Pathogen detection using next generation sequencing

ABSTRACT

Embodiments are directed to systems and methods for pathogen detection using next-generation sequencing (NGS) analysis of a sample. Embodiments may apply alignment algorithms (e.g., SNAP and/or RAPSearch alignment algorithms) to align individual sequence reads from a sample in a next-generation sequencing (NGS) dataset against reference genome entries in a classified reference genome database. Embodiments of the present invention may include classifying, filtering, and displaying results to a clinician that can then quickly and easily obtain the results of the sequencing to identify a pathogen or other genetic material in a sample that is being tested. A negative sample and a corresponding database can be used to remove contaminants from a list of candidate pathogens. Thus, embodiments are directed to a system that is configured to filter the results of a sequencing alignment and classify a sample quickly.

CROSS-REFERENCES TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent applicationSer. No. 15/917,286, filed Mar. 9, 2018, which is a continuation ofInternational Patent Application No. PCT/US2016/052912 filed Sep. 21,2016, which claims the benefit of and priority to U.S. ProvisionalApplication No. 62/221,574, filed Sep. 21, 2015, the entire contents ofwhich are herein incorporated by reference in their entirety for allpurposes.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH

This invention was made with government support under grant no. HL105704awarded by the National Institutes of Health. The government has certainrights in the invention.

SEQUENCE LISTING INCORPORATION BY REFERENCE

The Sequence Listing written in file Sequence-Listing ST25.txt createdon Jan. 29, 2020, is 784 bytes, machine format IBM-PC, MS-Windowsoperating system, in accordance with 37 C.F.R. §§ 1.821-1.825, is herebyincorporated by reference in its entirety for all purposes.

BACKGROUND

Infectious diseases affect the lives and health of millions of patientsannually. Failure to obtain a laboratory-confirmed diagnosis for manyacute infectious diseases directly contributes to poor patient outcomesand a high cost burden to the health care system. Key areas of unmetclinical need include neurological infections (encephalitis andmeningitis), pulmonary infections (e.g., pneumonia), blood infections,and sepsis.

Traditional diagnostic methods, including culture, antigen detection,and nucleic acid amplification, are limited in scope in cases for whichthere is little clue regarding the identity of the causative agent. DNAsequencing is the process of determining the precise order ofnucleotides within a DNA molecule. It includes any method or technologythat is used to determine the order of the four bases—adenine, guanine,cytosine, and thymine—in a strand of DNA. DNA sequencing can be helpfulin identifying potential causes of disease in patients. For example,alignment processes may be used to identify matching portions of asample sequence with a reference database of classified referencesequences.

However, DNA sequencing of a sample and the alignment of such sequencesfor identification of a potential source of disease, typically includeslong processing times due to the large amount of information to compareand process in order to identify a matching sample sequence andreference sequence. Additionally, because of the vast amount of datawithin sequencing data, sequence alignment results can return largenumbers of false positives where portions of sequence reads may appearto match to portions of a reference genome that is not in fact presentin the biological sample. As such, many alignment results from a DNAsequence alignment process may not be accurate and raw results of suchalignments are not useful as-is in a clinical environment because anexpert and/or clinician must analyze the results and manually interpretthe returned alignment results to interpret the results of thesequencing reads.

Thus, challenges in the field include developing accurate pipelines (orprocesses) that can quickly analyze millions of reads that includemillions of data points that come out of a DNA sequence system, as wellas interpret the data so that it is clinically useful to laboratoryscientists and/or a physician. Accordingly, there is a need for systemsthat are capable of quickly and efficiently identifying and interpretingnext generation sequencing data for detection of potential causes ofdisease and/or any other potential applications of DNA sequencealignment information.

Embodiments of the present invention solve these and other problemsindividually and collectively.

BRIEF SUMMARY

Embodiments are directed to systems and methods for pathogen detectionusing next-generation sequencing (NGS) analysis of a sample. Embodimentsmay apply alignment algorithms (e.g., SNAP and/or RAPSearch alignmentalgorithms) to align individual sequence reads from a sample in anext-generation sequencing (NGS) dataset against reference genomeentries in a classified reference genome database. Various embodimentscan filter, classify, and display results to a clinician to identify apathogen or other genetic material in a sample that is being tested.Embodiments can provide various systems that are configured to filterthe results of a sequencing alignment and classify a sample quickly andaccurately.

As an example, two alignment techniques (one being faster than theother) can be used together to speed up alignment, without sacrificingaccuracy. An initial alignment technique can identify which referencegenomes in a database match to which sequence reads. For a matchingreference genome, an optimally-aligning sequence read can be identified.For the optimally-aligning sequence read, a different alignmenttechnique can be applied, and it can be determined whether any of thenew alignment scores to other reference genomes exceed the optimalalignment score for the matching reference genome. If an new alignmentscore exceeds the optimal alignment score for the optimally-aligningsequence read, the matching reference genome can be removed from a setof matching reference genomes. The set of matching reference genomes canthen be output.

As another example, sequence reads can be assigned to a particularclassification level, so as to provide accuracy identified of aparticular pathogen. Sequence reads can be identified that match to twoor more matching reference genomes of the classified reference genomeswith at least the minimum alignment threshold. For such a sequence read,a taxonomy identifier can be assigned from classification information toeach of the two or more of the classified reference genomes. Thetaxonomy identifier can include at least two levels of classification.The assigned taxonomy identifier of each of the two or more classifiedreference genomes can be compared at each of the at least two levels ofclassification, with levels that do not match being removed. The lowestlevel shared between the two or more reference genomes can be assignedto the sequence read. Updated alignment results can be provided andinclude a number of corresponding sequence reads for each of theplurality of taxonomy identifiers.

As another example, background contaminants can be identified andremoved from a set of potential (candidate) pathogens that areclinically-relevant. A negative control sample can be used to identifysequence reads from potential contaminating organisms. A ratio can betaken of a first amount of sequence reads from a test sample that alignto a matching genome and a second amount of sequence reads from thenegative control sample that align to the matching genome. The ratio canbe compared to a threshold to identify a set of one or more matchingreference genomes have the ratio exceed the threshold. An output canidentify a set of one or more matching reference genomes as potentialpathogens in the test sample.

Other embodiments are directed to systems and computer readable mediaassociated with methods described herein.

A better understanding of the nature and advantages of the presentinvention may be gained with reference to the following detaileddescription and the accompanying drawings. Reference to the remainingportions of the specification, including the drawings and claims, willrealize other features and advantages of the present invention. Furtherfeatures and advantages of the present invention, as well as thestructure and operation of various embodiments of the present invention,are described in detail below with respect to the accompanying drawings.In the drawings, like reference numbers can indicate identical orfunctionally similar elements.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a block diagram of a system 100 for sample analysis,identification, and classification according to embodiments of thepresent invention.

FIG. 2 shows an exemplary method of analyzing and matching sequencereads of a sample to one or more classified reference genome databasesaccording to embodiments of the present invention.

FIG. 3 shows an exemplary method of analyzing and matching sequencereads of a sample to one or more classified reference genome databasesand identifying biological material corresponding to sequence reads ofthe sample according to embodiments of the present invention.

FIG. 4 shows an exemplary method of applying a taxonomic classificationalgorithm to the results of an alignment process to one or moreclassified reference genome databases for sequence reads of a sampleaccording to embodiments of the present invention.

FIG. 5 shows an exemplary method of filtering the results of analignment process to one or more classified reference genome databasesfor sequence reads of a sample according to embodiments of the presentinvention.

FIG. 6 shows an exemplary table including a summary of the results of anexemplary sequencing process for three different sample sequence readsaccording to embodiments of the present invention.

FIG. 7 shows an exemplary result table including matching referencegenomes to three different sample sequence reads as a result of anexemplary sequencing process without filtering or taxonomicclassification processes being applied to the results according toembodiments of the present invention.

FIG. 8 shows an exemplary result table including matching referencegenomes to three different sample sequence reads as a result of anexemplary sequencing process with a filtering process being applied tothe results but no taxonomic classification process being applied to theresults according to embodiments of the present invention.

FIG. 9 shows an exemplary result table including matching referencegenomes to three different sample sequence reads as a result of anexemplary sequencing process with a taxonomic classification processbeing applied to the results but no filtering process being applied tothe results according to embodiments of the present invention.

FIG. 10 shows an exemplary result table including matching referencegenomes to three different sample sequence reads as a result of anexemplary sequencing process with both a filtering process and ataxonomic classification process being applied to the results accordingto embodiments of the present invention.

FIG. 11 is a flowchart of a method for identifying pathogens in a testsample of biological material according to embodiments of the presentinvention.

FIG. 12 shows table of sequencing statistics according to embodiments ofthe present invention.

FIG. 13 shows a table of the number of reads from DNA libraries and RNAlibraries aligning to viral sequences according to embodiments of thepresent invention.

FIG. 14 shows an exemplary computer system.

An Appendix includes Supplementary Tables 1-4.

DETAILED DESCRIPTION

Embodiments can provide processes for rapid analysis of next generationsequencing (NGS) data for pathogen detection. For example, embodimentsmay be used in a broad, comprehensive pathogen diagnostic for infectiousdiseases by analyzing sequencing results against many reference genomes,as a metagenomics analysis. The use of unbiased metagenomicnext-generation sequencing (mNGS) can provide for detection of allpotential pathogens in a single assay. An advantage is the ability todetect all viruses, bacteria, fungi, and parasites in a single,standardized universal test directly from diverse clinical sample typessuch as cerebrospinal fluid (CSF), bronchoalveolar lavage (BAL), andplasma, thereby maximizing the potential impact on patients with acute,life-threatening infections by early and more accurate diagnosis.

Embodiments may apply alignment algorithms (e.g., SNAP and/or RAPSearchalignment algorithms) to align individual sequence reads from a samplein a next-generation sequencing (NGS) dataset against reference genomeentries in a classified reference genome database. For example, as aresult of sequencing of a sample, the system may obtain a list ofresults of reference sequences that align with the sequencing reads of asample. Individual sequences in the reference genome database (alsoreferred to as a GenBank) are referred to as reference genome sequencesand may be identified by genome identifiers (GI). The genome identifiersinclude reference identifiers used to identify reference sequencegenomes stored in the GenBank database. A best GI match can be assignedto each read, and the taxonomy assigned to each GI according to theclassified reference genome in the GenBank can be assigned to each read.

Such unbiased analysis (e.g., using the large number of referencegenomes in the GenBank) can make accurate detection of specificpathogens difficult as many human samples include flora or backgroundcolonization organisms. For instance, a nasal swab has a lot of bacteriain it because that's going to be bacteria that sort of colonize yourrespiratory tract. Thus, when the system makes a detection, it isdifficult to know whether the system has detected a pathogen or acolonizer. Further, because the analysis does not bias or target anyindividual pathogen, the system could be detecting, for instance, abacterial contamination of enzyme preps that are used in making thesequencing libraries. So effectively the system may identify laboratoryreagent contamination instead of matching to a particular infectiousdisease.

Accordingly, while the sequencing analysis and matching is sensitive,the analysis may return false-positive matches (e.g., a human read willalign to a viral GI due to database misannotations). And, referencegenome databases may include many GI entries that are poorly annotated(e.g., “uncultured eukaryote” instead of a particular identification)or, even worse, incorrectly annotated (e.g.,[http://][www].ncbi.nlm.nih.gov/nuccore/KC506764.1, annotated as GBV-Bwhen this is actually GBV-C virus). These databases typically allowanyone to add reference genomes and there is no standard for annotatingsamples.

Further, many of the alignments are not accurate and raw results aloneare not useful as-is clinically because an expert and/or clinician mustanalyze the results and manually interpret the returned alignmentresults to interpret the results of the sequencing reads. Typically,such interpretation required an expert genomicist or bioinformaticist orsomeone who is well-experienced with infectious diseases or laboratorymedicine and who also understands the bioinformatics. Accordingly, suchraw results are not helpful or usable in a clinical laboratory wheresuch processes can provide great value to clinicians caring forpatients.

To address this issue (non-specific identification), embodiments canfilter and classify sequencing reads to obtain more accurate results andavoid false-positives, maximizing the specificity of detection.Embodiments can filter, classify, and display results to a clinicianthat can then quickly and easily obtain the results of the sequencing toidentify a pathogen or other material that is being tested. Filteringcan be helpful because the genome databases may be rife withmisannotations and false positive matches.

Further, typical sequencing methods compare sequences to referencedatabases and then identifying hits based on what the sequence isaligned to. However, the actual interpretation of those hits is moresubtle because an expert actually has to analyze the data to determineif some hits are not accurate and/or do not provide enough informationto be classified as a particular pathogen. Embodiments can improve theclassification of reads by applying a rapid taxonomic classificationalgorithm to alignment results to provide more accurate and clinicallyuseful results. Additionally, embodiments can annotate the data withinthe genome reference database so that the system knows what referencegenomes are a pathogen, a colonizer, what is considered likelycontamination, etc. Embodiments can also provide a user-friendlyvisualization interface that can be used by clinicians to quickly andeasily identify pathogens and other results.

Embodiments can provide a number of advantages including limiting themanual annotation and interpretation of sequencing analysis results toallow quick, efficient, and useful clinical and public healthsurveillance. For example, embodiments may be used where a patient issick and a clinician may take one or more samples from the patient todetermine what pathogens are present in the samples. The clinician maydesire to know whether the patient has a viral infection, a bacterialinfection, a fungal infection, a parasite, etc. Any type of biologicalsample that obtains DNA material could be used to identify potentialcauses of the disease. For instance, blood, cerebrospinal fluid,respiratory secretion, tissue, stool, etc., may be used to obtain asequence read of the samples. Additionally, embodiments may be used inblood bank testing, food and water quality testing, environmentaltesting, animal testing, animal health, or any other area that may beassisted by quickly and efficiently determining potential sequencematches within a sample. However, a sample will likely include a mixtureof multiple organisms. Some of these organisms may be from a person,viral, bacterial, and many samples are mostly mixtures of differentorganisms.

The mNGS assay has been clinically validated for diagnosis ofencephalitis and meningitis in cerebrospinal fluid. The assay canincorporate: (1) the analytic wet bench process including handling ofpatient samples, nucleic acid extraction, mNGS library preparation, andsequence generation on an Illumina HiSeq instrument, and (2)bioinformatics analysis of sequence data and clinical interpretation bytrained microbiologists/pathologists. Results of diagnostic mNGS testingfrom cerebrospinal fluid are reportable in the medical chart and can beused for clinical management. The clinical implementation of the mNGSassay has a direct, positive impact on patient outcomes by increasingthe number and proportion of patients with an accurate, clinicallyactionable infectious disease diagnosis that allows for timelymanagement and treatment. The assay is also able to detect rare,unexpected, or slow growing or uncultivable microorganisms, for whichdiagnosis is often delayed or missed. The mNGS assay may have particularutility in identifying culture-negative pathogens due to prior treatmentwith antibiotics. The mNGS assay may also be useful as a “rule-out” testfor infectious diseases, which may impact management by increasingclinical confidence in working up and treating non-infectious causes ofencephalitis/meningitis such as autoimmune disease with steroids and/orimmunosuppressive medications.

The mNGS data can also yield additional information besides whether ornot a given microorganism or microorganism type is present or absence.The number and proportion of reads, typically expressed as a “reads permillion (RPM)” metric normalized to a negative “no-template” controlsample run in parallel, can provide some degree of quantitative or atleast semi-quantitative information. In some cases, the pathogen genomecoverage may be sufficient to facilitate (1) precise genotyping orstrain identification, (2) analysis of single-nucleotide polymorphismsor mutations, (3) the generation of predicted antibiotic/antiviralresistance profiles.

I. Example Systems

Once a clinical sample has been obtained (e.g., from a patient or froman environmental samples, such as a water sample), the sample may besequenced, and the sequencing results can be analyzed. Various systemsand processes can be used. For example, nucleic acid extraction can beperformed. A cDNA/DNA library preparation may include adding adapters,although other preparation for other types of sequencing may beperformed, e.g., for nanopore sequencing or other single moleculesequencing. The library of templates can be fed into a sequencing deviceto provide sequencing information, e.g., base calls or raw signals thatare used to determine base calls, thereby obtaining sequence reads. Theanalysis of the sequence reads can include host subtraction; adapter,quality, and low-complexity trimming; and alignment against referencedatabases such as NCBI (National Center for Biotechnology Information)GenBank.

FIG. 1 shows a block diagram of a system 100 for sample analysis,identification, and classification according to embodiments of thepresent invention. As shown, system 100 includes a sample analysissystem 110 and a sequence identification computer 120. Sample analysissystem 110 may be communicatively connected to the sequenceidentification computer 120 through a communications network 140, whichmay be any suitable communications network for communicating data. Asanother example, data may be communicated via a removable storagedevice, such as a USB drive.

The sample analysis system 110 may be configured to receive a sample 150(e.g., after library preparation) from a clinician or other operator,sequence the sample to obtain a plurality of sequence reads of geneticmaterial for the sample, and submit the plurality of sequence reads tosequence identification computer 120. A sequencing device 115 cancorrespond to any sequencing device, such as those produced by Illumina,Pacific Biosciences, or Oxford Nanopore. A processor 111 (e.g., a CPU)can control aspects of sequencing device 115, such as one or morecameras for taking images of the nucleic acids or electrical components,both receiving sequencing signals corresponding to nucleotides of thenucleic acids being sequenced.

A memory 112 (e.g., flash memory, hard drive, DRAM, cache, etc.) canstore software for controlling processor 111, which can controlsequencing device 115. In some embodiments, a sample collection module113 can control robotic processes for obtaining a sample (e.g., via asyringe connected to a robotic arm), and perform any automatedpreparation processes. A sample sequencing module 114 can instructprocessor 111 to perform the sequencing, using sequencing device 115.Sample analysis system 110 can perform analysis of the raw sequencingsignals (e.g., fluorescent or electrical signals) to identify basecallsof sequence reads, or send such raw sequencing signals to sequenceidentification computer 120.

Sequence identification computer 120 can process, align, and identifygenetic material that is present in the sequence reads to identifypathogens and/or other genetic material that is present in the sample.An alignment module 123 in memory 122 can instruct processor 121 toalign the sequence reads to a plurality of reference genomes, e.g., asstored in reference genome database 130. One or more alignmenttechniques (e.g., a local aligner, such as SNAP) can be used to obtainalignment results, e.g., initial alignments results, as well assubsequent alignment results. A classification module 124 can classifythe alignment results to include an accurate taxonomic classificationlevel. A filtering module can filter the alignment results to removefalse positives. Sequence identification computer 120 can return resultsfor identifying pathogens and/or other genetic material present in thesample.

Reference genome database 130 can contains a plurality of referencegenomic sequences that have been identified and classified as beingassociated with a particular biological organism. For example, referencesequences of different viruses, bacteria, fungi, human, animal, and/orany other reference DNA sequences of any other biological material maybe stored in the classified reference genome database. Sequenceidentification computer 120 may apply one or more alignment techniquesto align the received sequence reads from the sample to the classifiedreference genome sequences stored within reference genome database 130.

Although FIG. 1 shows the modules and describes the functionality of themodules as being completed by the two systems, in some embodiments, allor a portion of the functionality may be split differently and/or couldbe completed by a single device. For example, in some embodiments,sample analysis system 110 of FIG. 1 may be coupled to a referencegenome database and/or the reference genome database may be stored aspart of sample analysis system 110, and sample analysis system 110 maycomplete all of the functionality described herein.

System 100 can be used to identify and detect pathogens in clinicalsamples, e.g., for the purposes of surveillance, such as epidemiologicsurveillance. For example, it may be desired to look at cases ofsequences generated from patients with acute diarrheal illness or feverto identify and detect pathogens. Surveillance can also be performed todetect novel pathogens, thereby allowing pathogen discovery, e.g., whenno match is found in reference genome database 130. Another applicationis use as a diagnostic tool, e.g., by identifying a known pathogen forwhich treatment is known.

Metagenomic sequencing can include alignment to multiple referencegenomes, as may be stored in reference genome database 130. One exampleof such a database is GenBank, which has more than 20 gigabases in size.With such a large size, the alignment of aligning millions of sequencereads can provide many hits (matching alignments) to different genomes,which is why classification and/or filtering can be useful. There mayactually be a pathogen in that sample that is identified by a hit, butthere can be false hits, as may occur due to problems in the databases.Databases themselves are not well curated, and there are incorrect orerrors in how they are constructed. For example, there may be sequencesannotated as hepatitis C virus but may actually be a human sequence forinstance and vice versa. The database itself can be cleaned up, as well,e.g., when misannotations are identified, as may be done using BLAST.

Additionally, a sample can include a large amount of background nucleicacids. Techniques can be used to filter out such background sequences,as may be done with a no template control (NTC) sample. NTC database 135(also called a contaminant database) can store background sequences thathave been identified in recent samples, e.g., samples within the lastmonth. Levels (e.g., number of reads) that a background sequence hasbeen quantified can be stored, as is described below.

For the taxonomic classification, individual sequences can be classifiedaccording to where they fit on the taxonomic level. For instance, if youhave sequences that are aligned to certain regions (e.g., viral orbacterial genome) those individual reads may not be specific for thatspecies. That is, it may not be specific to that viral genome of a viralspecies. In such a case, the read can be classified using a least commonancestor algorithm to the next higher taxonomic level. Once the sequencereads are classified, the system can then point out what species arespecific to the sample, thereby allowing a good prediction that anactual viral species is in the sample.

In some embodiments, the classification can use the results of aninitial alignment technique, e.g., SNAP, thereby allowing theclassification to start while alignment is being performed for othersequence reads. The efficient alignment can relatively concurrentclassification can allow a taxonomic classification in one to two hours.Accordingly, if the alignment results provide 100 hits, theclassification can narrow to about 10 to 15. However, for clinicalpurposes, it is desirable to have only a few hits, e.g., two. The 10-15hits may be real, but may not be clinically significant. For example,the hits may include microorganisms that are part of the laboratorybackground contaminants or part of the skin background as part ofdrawing blood. Techniques to filter out such contaminants are describedin more detail below, as well as alignment and classificationtechniques.

II. Data Analysis

Once the system has obtained the raw sequence reads from the sequencingmodule, the raw sequence reads are analyzed to obtain an identificationof matching reference sequences from one or more classified referencegenome databases. Embodiments can provide data with respect to whatsequences from different microbial pathogens are detected.

A. Example Pipeline

FIG. 2 shows an exemplary analysis process 200 for processing thesequence reads, identifying matching reference genome sequences, anddetermining matching viral and bacterial matches for the sample sequencereads. Process 200 may be performed with any suitable computer systemthat is connected to a sequencing machine, e.g., via a network or aremovable storage device, such as a USB drive.

At step 210, the raw sequence reads arc received from the sequencingmodule. The sequencing process may return a large number (e.g., 1, 2, 3,5, 10, 20, 50, or 100 million reads, or more) of DNA sequence reads froma sample. As examples, the raw sequence reads may be received in fasta,fastq, sam, or bam files. The sequence reads can be obtained in variousways, e.g., from a sequencing of nucleic acids or probe-based methods,such as hybridization arrays. The use of random sequencing thatessentially sequences all nucleic acids can be preferable so that asignificant number of pathogens (e.g., as many as reference genomes areavailable) can be detected.

At step 220, the raw sequence reads are preprocessed to trim the readsto remove low quality sequences, remove low complexity sequences (e.g.,as would not be very informative), remove adaptors that can be retainedat the ends, etc. Thus, the reads may be cleaned-up to ensure that theremaining reads are of high quality. If adaptors are used, the adaptors(adapters) may be ligated to the end of the nucleic acids and used asprimers for sequencing, and thus may not relate to the actual nucleicacids. The low complexity reads (e.g., with many repeats) can bedifficult to align, and thus removed. The quality of a read can bedetermined based on the quality of the basecalls, which may bedetermined based on the sequencing signals.

At step 230, an alignment module (e.g., alignment module 123) performs asequence alignment analysis for the sample sequence reads to thereference genomes. In some embodiments, the alignment analysis may beperformed in reference to a host genome database first and any samplesequence reads that align to the host may be removed from the sequence.For example, in embodiments that are identifying pathogens associatedwith a human sample, the sequence reads may be aligned with a referencehuman genome database and any matches may be removed (since no pathogensshould be present in the human reference genome database). Comparisoncan be made to multiple reference human genomes.

Thus, embodiments can use computational subtraction to speed up theidentification processes to remove any sequence reads that align withthe host (and thus are not helpful for identifying biological materialfrom other entities (i.e., pathogens)). In embodiments where a singlelarge reference database is used, any results that are identified asmatching with a human reference genome could be removed. However, byapplying the full 100% of the sequence reads and not subtracting thembefore aligning with the full reference database, the process may beslower than if those sequence reads were removed before comparing to thelarge database of reference genomes.

The host database (e.g., human genome database) may include fewergenomes than the entire reference genome database. Accordingly,sequencing can be faster than comparing to all possible referencegenomes. Thus, for relatively high quality sterile samples, applyingsuch a host filtering method may remove most of the background readsquickly. For example, for a good sample, the removal of the host matchescould take the analysis from potentially 100% of the reads down to maybeless than 10% of the reads. Thus, 90% of the sample sequence reads canbe removed by alignment to a host (e.g., human) database.

Additionally, in some embodiments, a host database could includesimilarly related genomes that are not specifically from the host. Forexample, for the human example provided above, the system could alsoinclude primates that have similar genomic references to humans. Thisprovides a more comprehensive host analysis and provides even faster andmore effective subtraction of host matches.

Next, depending on the mode of the identification process, either acomprehensive mode or a fast mode can be performed to identify one ormore matching reference genomes associated with the sample sequencereads. The fast analysis may only align the sample reads to a referencegenome database of bacterial and viral reference genomes. Thus, the fastmode analysis may not identify all potential genomic matches in thesample sequence reads but will focus on the analysis on potentialbacterial and viral matches due to the focus of the process onidentifying pathogens within the samples. The comprehensive mode mayalign the sample reads to an entire nucleotide classified referencegenome database that includes reference genomes from bacterial 240A,fungal 240B, parasitic 240C, viral 240D, and other 240E referencegenomes.

At step 240 of the comprehensive mode, the alignment module performs asequence alignment analysis for the sample sequence reads to thereference genome database, e.g., to reference genomes 240A-240E, Steps241 and 242 of the fast mode may only perform alignment to the bacterialdatabase 240A and viral database 240E, respectively. Any number ofdifferent sequencing algorithms may be used in embodiments of thepresent invention. For example, the “Scalable Nucleotide AlignmentProgram (SNAP)” algorithm includes a nucleotide aligner that takes rawsequence data and aligns it to nucleotide reference databases. SNAP isextremely fast and by using fast sequencing algorithms, the analysisprocesses of the present invention may return results extremely fast.While other analysis methods (i.e., “pipelines”) may analyze a samplesequence data within days, weeks, or months, embodiments of the presentinvention can analyze the sample sequence data in minutes to hours. Fastsequencing analysis is critical in a number of applications including,for example, infectious diseases analysis that can be paired with a nextgeneration sequencing essay that can diagnose infectious diseases andget results to physicians regarding patient care within 8-12 hours, oras soon as possible

For example, in some embodiments, the system may align the sequencereads from the sample to all classified reference genomes in GenBankusing a SNAP alignment algorithm. However, the GenBank is growing veryfast (e.g., doubling every year) so it can be important to limit thenumber of reads being analyzed to improve the speed of the alignmentanalysis with the reference database (e.g., GenBank). For example, insome embodiments, a specific subset of the GenBank may be used toimprove analysis speeds further. For example, if the clinician is notconcerned about potential plant matches, such reference genomes may notbe applied to the alignment analysis and/or a separate subset ordifferent database may be used to avoid potential plant genetic matches.Accordingly, some embodiments may align to a tailored search forpotential pathogens, bacteria, etc. and avoid aligning to referencesequences that are not a part of the tailored search.

At step 250, in the comprehensive mode, de novo contig assembly may beperformed to obtain contigs associated with the results of theassignment from the classified reference genome database. Exampleassembly software include ABySS and Minimo.

At step 260, in the comprehensive mode, the reads and the contigs areused to align the translated nucleotides using another alignmentalgorithm (e.g., RAPSearch) and compared to a viral protein referencedatabase. To generate translated nucleotides, the sequencing reals aretranslated in all 6 reading frames and the resulting amino acidsequences are then compared to a protein reference database. The outputof RAPsearch is similar to that of BLAST but lists alignments thatexceed a predesignated E-value significance threshold.

At step 270, the results of the alignment algorithms are obtained, andmay include a summary table 270A that provides the matching viral,bacterial, fungal, parasitic, and other reference genomes that havealigned to the sample sequence reads. Thus, in response to thealignment, the system can provide a table that shows alignments tobacterial and viral reference genomes. Summary table 270A may list allthe hits to the different species in the sample and the different generain the different families of organisms found in the sample, e.g.,according to taxonomic classification 270C. For example, the system maydivide the results into viruses, bacteria, non-cordate eukaryotes(basically fungi and parasites that do not have a backbone), human, aswell as an “other” category. Cordates are all higher level eukaryotes,eukaryotic organisms that have a backbone. Eukaryotic organisms that donot have a backbone are more often microorganisms. These categories areused in order for the system to identify fungi and parasites as well asother microorganisms that do not have backbone (e.g., invertebrates likeworms). The system may find matches that are not necessarilymicroorganisms, for instance, the system may be capable of diagnosing atape worm infection.

Thus, at the end of the alignment process and in an initial table, thesystem may provide matches for different bacteria and may include thenumber hits for each type of bacteria, virus, etc. Furthermore, in someembodiments, de novo assembly may be applied to actually recreate thegenome and coverage maps 270B of the genome may be provide to indicatehow well the reads cover the genome of one the genomes in the list.

Accordingly, the results of the alignment process of FIG. 2 can provideall possible matches for one or more DNA sequence reads from a sample.For example, summary table 270A may include fifty bacteria and/or twentyviruses. Next the question becomes which one of these or are all ofthese causing the potential infection? Some of these results may becontaminates, some might be colonizers, etc., so the results must beinterpreted to identify a potential candidate as the pathogen that iscausing an infection. Traditionally, if a system detects three differentbacteria, a clinician has to interpret the matches to determine whichone of those three bacteria or which combination of bacteria are themost important in terms of causing the disease. Accordingly, it isimportant to be able to limit as many potential false-positives andincorrect matches as possible to assist clinicians and users of thesystem to identify the most likely one or more pathogens responsible foran illness.

Accordingly, embodiments of the present invention provide (1) filtering,(2) taxonomic classification, and (3) best match algorithms foridentifying and providing the most useful data to a clinician.Additionally, embodiments provide additional features including RNAsequence removal and on-the-fly annotation of database entries tofurther assist interpretation of alignment results.

B. Flowchart of Example Data Analysis

FIG. 3 shows a streamlined process 300 for identify, matching, andinterpreting sample sequence read alignments with classified referencegenome databases according to embodiments of the present invention.Aspects of process 300 may be performed in a similar manner as process200. Process 300 may be performed using system 100.

At step 310, a system may receive a sample from a patient and/or otherbiological item and/or entity. The sample may be provided by a clinicianor other operator of the sample analysis system. The sample may formnucleic acids that have already been prepared into a library forsequencing.

At step 320, the system may obtain sequence reads of nucleic acids fromthe sample. Any suitable method for obtaining DNA sequence reads may beused, e.g., as described herein.

At step 330, the system may preprocess the sequence reads. Thepreprocessing can include trimming sequence reads and removing somereads, e.g., low quality or low complexity reads.

At step 340, the system may apply an alignment algorithm to thepreprocessed sequence reads and obtain alignment results for thesequences reads. Each read can be aligned to a classified referencegenome database, e.g., including millions of reference genome sequences.The alignment may return millions of “hits” or alignments to theclassified reference genome database.

Some of the matches may not be perfect and may include sequence readsthat match a portion of a reference genome sequence. A quality value maybe returned with the alignment results that indicates a measurementand/or magnitude of similarity between the sequence read and thereference genome sequence. Additionally, individual reads may align withmore than one reference genome. Accordingly, alignment results mayinclude an identifier of the reference genome that was matched,classification information associated with the annotated informationprovided when a reference genome was uploaded to the classifiedreference genome database, and a similarity measurement indicating thequality and/or closeness of the alignment between the sequence read andthe reference genome.

At step 350, the system may provide a taxonomic re-classification of thealignment results for reads that have aligned to multiple referencegenome sequences with different taxonomic classifications. For example,the system may compare levels of classification between multiple alignedreference genomes for a particular sequence read and may remove one ormore classification levels that do not match between the referencegenome sequences aligned with a particular sequence read. Additionaldetails and steps associated with the rapid taxonomic classificationprocess is provided in reference to FIG. 4 below.

At step 360, the system may filter the alignment results to remove falsepositives and mis-annotated or mis-classified reference genome sequencesthat aligned to the sequence reads of the sample. For example, thesystem may select a best match sequence read for an identified referencegenome and may apply a second alignment technique to the best matchingsequence read to identify if the same reference genome aligned with thesecond independent alignment. If the alignment of the sequence read doesnot match the same reference genome, the read and the reference genomemay be removed from the alignment results as the reference genome islikely a false positive hit. Additional details and steps associatedwith the filtering steps are provided in reference to FIG. 5 below.

In some embodiments, the filtering may include removal of a referencegenome as a viable pathogen when the taxonomic classification is listedin the results of a no template control (NTC) for the current experimentor a database of previous NTC results. Further details are provided in asection below.

At step 370, the system can use the updated alignment results includingthe filtered and taxonomically re-classified sequence reads to identifyone or more best matches for the sequence reads from the sample. Forexample, the best match may be selected based on the reference genomesequence that aligned with the most sequence reads.

At step 380, the system provides the results of the best matchinganalysis to the analysis system being operated by the clinician. Theresults may be provided via a network connection. The results mayinclude the most probable one or more pathogens that may be responsiblefor a patient's illness.

At step 390, the analysis system provides and/or displays the results tothe clinician. The results can be provided in any variety of ways, e.g.,visually or by audio. The results can indicate a treatment to beprovided, thereby providing a therapeutic intervention.

C. Example Best Match (GI Picker)

An objective of embodiments can be to pick the most informative one ormore genome identifiers (GI) out of the GIs matched by all the samplereads of a single taxonomic assignment, a particular species. As part ofthe analysis, a score can be determined for each of the matchedreference sequences, specifically each GI. In some implementations, thescores are determined for only GIs at a given taxonomic level, e.g.,species. In other implementations, the total scores are determined forGIs across various (e.g., all relevant) taxonomic levels, with a topscore being picked across all the GIs. The total scores can bedetermined from respective scores of various properties.

In some embodiments, each of the matched reference sequences can bescored as to any or all of length, coverage, identity, and percentidentity. Thus, four scores can be determined and used to determine atotal score. Examples of how the scores can be determined are providedbelow.

Length is the length of the reference sequence. Thus, a length score islarger for longer reference sequences. Therefore, a species ofmicroorganism that has a larger genome would have a longer length score.In some embodiments, matches can be ranked by whether or not completeversus partial genomes are available in the database, where use of onlya partial genome would affect the length score.

For each reference sequence, a coverage score can be determined. In oneembodiment, the coverage score can be determined as a cumulative scorerepresenting a sum of the counts of aligned reads at each genomicposition (e.g., base position) along the reference sequence. In anotherembodiment, the coverage score can be determined using a sum of genomicpositions that have at least one read aligned to that position, e.g.,regardless if there is a match that position. The coverage score may bea percent of the genome that is “covered” by at least one aligned read,and thus the sum of genomic positions that have at least one readaligned to that position can be divided by the total number of positionsin the database for that reference genome.

For reference sequence, the identity score can be determined as acumulative score representing a sum of fractional scores at each genomicposition. A fractional score can be calculated as the number of alignedreads with a nucleotide matching the reference sequence divided by thenumber of aligned reads whether or not matching the reference sequences(i.e., the total number of aligned reads covering the position). Thus, agenomic position will have a larger fractional score when there arefewer mismatches of reads at that position.

Percent identity can be calculated by dividing the identity score by thecoverage score.

Once the individual scores are determined, an overall score can becalculated for each reference sequence (taxonomic identifier). In oneembodiment, the overall score can be determined by adding ‘length’ and‘identity’ and multiplying the sum by ‘percent identity’. The referencesequences can be ranked by the overall score, e.g., in descending order.And, a top score(s) can be selected, e.g., by choosing the firstreference sequence in the list. In some embodiments, more than onereference sequence can be selected, e.g., all having a total score abovea score threshold. In other embodiments, the top N or X % of taxonomicidentifiers can be identified.

Accordingly, for each sample, a score can be determined for eachrelevant GI (e.g., a set of taxonomy identifiers) at one or moretaxonomic levels (e.g., for each species, genus, family, etc.). Thescores may be determined only at one taxonomic level, or at a givenlevel and all GIs at lower levels. As an example for each GI, the readsaligned to that GI can be grouped. For each GI, the coverage andidentity score can be determined in the following manner. For eachalignment of a read, the following calculation can be performed for eachalignment position: map the read to the genomic position of the GI, theposition coverage score is increased, and the position identity score isincreased if there is a match. Across all genomic positions for the GI,the total coverage score can be increased by one for each position thathas a position coverage score greater than zero. The total identityscore can be determined as the sum of the fractional scores, computed asthe position identity score divided by the position coverage score. Thepercent identity score can be determined as the total identity scoredivided by the total coverage score.

In some embodiments, the following pseudo code can be used:

for each sample

-   -   for each species        -   group by GI all reads and alignment data        -   for each GI            -   score covered and identity                -   for each alignment                -    for each alignment position                -    map to GI position                -    covs[position]+=1 if match/mismatch                -    idents[position]+=1 if match                -   sum across all GI positions                -    covered+=1 if covs[position]>0                -    identity+=idents[position]/covs[position]            -   score percent identity=identity/covered        -   sort descending by (length+identity)*percent identity        -   choose first in sorted list

D. Mapping Reads to Selected GI

In various embodiments, an output (e.g., a file) can contain the pickedreference sequence (GI) and a consensus sequence of the reads mapped toit. The term “coverage” can be used in two contexts: (1) one of thesub-metrics that make up the overall score for picking a GI; and (2) anoverall metric, inferred from a coverage map file, for howcomprehensively a set of mapped reads cover the picked GI.

For picking a GI, a particular species can have multiple genomicidentifiers for different reference sequences in a database. Thedatabase of reference sequences used can vary or be constructed fromsome combination of separate databases. On database is the NCBI ‘nt’database. It is highly redundant, and often mis-annotated. A singlespecies can have thousands of entries, or GIs. A GI can be picked usingembodiments of the best match algorithm above.

After picking a GI for each species, scores can be determined for eachof the selected GIs (e.g., one for each species). The scores can bedetermined using all reads assigned to the species level, and caninclude reads assigned to higher levels. The scores can be determined asdescribed above, and scores exceeding a threshold, e.g., an absolutenumber, a value at which the top N scores are above (where N is aninteger), or a value at which the top X % of scores are above.

Assigned reads at the subspecies, species, genus, or family level canalso be mapped to the selected GI for visualization. Mapping of reads athigher taxonomic levels (genus or family) will increase the coverage ofthe GI (the percentage of nucleotides that have at least one read mappedto it), but can also incur a risk of erroneous mapping from differentspecies or genera in the sample. Real-time visualization of the coveragemaps and pairwise identity plots can facilitate expert interpretationfor clinical results reporting.

III. Rapid Taxonomic Classification

Some embodiments may analyze clinical or environmental metagenomic data(e.g., metadata) to classify the origin of each of the millions ofnext-generation sequencing reads (e.g. bacterial, viral, human, etc.)that are returned by the alignment results. However, individuals readsmay be informative only at a given taxonomic level (species, genus,family, etc.). For instance, a read in the conserved matrix gene ofinfluenza viruses may be identical (or nearly identical) in influenza Aand influenza B, and thus this read cannot be used to distinguishbetween influenza A and B. This is problematic if one is interested inspecies or strain level identification, which is critically important ininfectious disease diagnosis (examples: Bacillus anthracia/anthraxversus Bacillus cereus; enterovirus versus rhinovirus; Ebola Zaireversus Ebola Reston).

As such, some embodiments may provide a taxonomic classificationalgorithm that removes some of the hits from consideration through themetagenomic data provided through the classification of the referencegenomes. The taxonomic classification algorithm may apply a least commonancestor (LCA) approach where taxonomic identifiers for each of theassociated reference genomes are analyzed to identify the lowest levelof shared ancestry between two or more reference genomes. For instance,if a sequence has hits to both a human and a virus, and the match isconserved between a human and a virus, then the algorithm would move theresult up to the proper taxonomic level, which would in this case beabove the kingdom level. In such embodiments, the system would assumethat it mapped to the human and would remove the result since thetaxonomic classification moved up to the kingdom level.

As another example, if the system returned hits for a single virus butdifferent versions of the virus (e.g., influenza A and influenza B), andthe hits are indistinguishable between influenza A and influenza B, thenthose sequences would be assigned to the influenza genus, and not to thespecies level (i.e., Influenza A and B). Thus, the system would move theclassification for this reference genome up to a higher level (e.g.,family enterovirus instead of mapping to two different enterovirusspecies—influenza A and influenza B).

Such classification allows a clinician to look at the results atdifferent levels and provide potential options for treatment. Using theinfluenza A or influenza B example above, the result may indicate thatthe system matched either influenza A or influenza B instead ofinfluenza A. For example, if the system is analyzing an influenza Asample (i.e., a sample taken from a patient infected with influenza A),the sequence results will likely return that 90% of the samplesequencing reads are going to align influenza A and 10% are going toalign to influenza B because of the indistinguishable portions of thegenetic material between influenza A and B. Reads may be from regionsthat are highly conserved between these two species of influenza. Thus,a clinician may not know, looking at the raw data, whether this patientis infected with both influenza A and B, is infected with influenza Aalone and it just happens to be that B was misaligned to influenza B, orvice-versa. However, by applying the taxonomic classifier and byproperly classifying those reads that were initially assigned toinfluenza B, the system moves up the classification to the genus level,and the remaining reads assigned to influenza A would allow a clinicianto make a determination that the patient has an influenza A infectioninstead of the erroneous call, which would be this a dual infection withinfluenza A and B.

Thus, after shared reference genomes have been classified at theappropriate level, the remaining reads in the alignment results shouldbe specific at each taxonomic level and thus, the remaining hits thatare species-specific are known to be particular to that species. Byremoving the alignment results that are shared between multiplereference genomes sharing a genus, family, etc., the results will bemore accurate and usable by the clinician. The reads that remain aftershared reference genomes of the same genus have been reassigned to agenus as opposed to a particular species associated with the alignedreference genome. Identifying species-specific reads can be extremelyinformative to clinicians where the clinicians know that the read isspecific to the particular species, sub-species, strains, etc. It canalso be useful to know the higher taxonomic level information includingthe number of reads at a particular genus, family, etc. Thus,embodiments provide more informative and accurate information for aclinician.

A. Classification Method

FIG. 4 shows an exemplary process 400 of taxonomic classification forfind the appropriate taxonomic classification level for reads andaligned reference genomes that may share portions of genomic sequencesbetween various species, genus, family, subspecies, strain, lineage,etc. Thus, embodiments allow for appropriate identification of possiblereads aligned to a reference genome to allow the system to provide thebest possible identification of potential pathogens causing illnessand/or present in a sample from a host organism. Process 400 may beperformed by system 100.

At step 410, the system receives the alignment results from applying analignment algorithm to the sequence reads from the sample. In someembodiments, the alignment results may have been previously filteredusing techniques described herein. The rapid taxonomic classificationprocess may be performed concurrently with the alignment process andthus, before the filtering process has been completed. Thus, the systemcan receive a plurality of sequence reads obtained from a sequencing ofDNA molecules from the sample of biological material where the sampleincludes DNA molecules from a plurality of organisms and aligns thesequence reads using an alignment technique to align the plurality ofsequence reads to a plurality of classified reference genomes in adatabase.

The system may obtain initial alignment results that include, for eachof at least a portion of the sequence reads, a matching reference genometo which the sequence read aligns. The initial alignment results mayfurther include classification information for each of the matchingreference genomes where the classification information includes ataxonomic identifier including multiple classification levels for eachreference genome. For example, a reference genome (e.g., enterovirus D)may include a species (e.g., enterovirus D), germs (e.g., enterovirus),and family (e.g., picornaviridae) classification level within ataxonomic identifier. Many other classification levels may be also beassigned and included in the taxonomic identifier and correspondingclassification information with the initial alignment results for eachaligned reference genome.

At step 420, the system identifies a set of the sequence reads thataligned to two or more of the classified reference genomes. For example,in FIG. 7, the human rhinovirus and the human enterovirus K1105 bothhave a single hit. However, as can be seen in results table 900 of FIG.9, after classifying the initial results, both of these referencegenomes have been re-classified.

Accordingly, these reference genomes matched with a read that aligned tomultiple reference genomes. Step 420 can ensure that sequence reads thatalign to only one reference genome are not subjected to unnecessaryfurther analysis, as those sequence reads can be classified based on thesingle matching reference genome.

Steps 430-490 can be performed for each of the sequence reads identifiedin step 420.

At step 430, the system identifies two or more matching referencegenomes of the classified reference genomes to which the sequence readaligns with at least the minimum alignment threshold. For example, usingthe example provided above in reference to FIG. 7 and assuming that asequence read matched to both of these reference genomes, the system mayselect a read that aligned to both of these species-specific referencegenomes. The minimum alignment threshold can ensure that a matchingreference genome is sufficiently similar to the sequence read at a givenlocation. The minimum alignment threshold may be used for all sequencereads, or can vary for different sequence reads, e.g., based on thespecific alignment quality for each of the matching reference genomes oras a result of taking the highest N matching reference genomes (e.g., asdetermined by an alignment score, such as a number of mismatches).

At step 440, the system selects a set of reference genomes for asequence read that matches to multiple reference genomes. For example,using the example provided above, the system may select the humanrhinovirus and the human enterovirus K1105 reference genomes that mayhave been associated with the same read. The selected set of matchingreference genomes may correspond to those identified in step 430. Inother embodiments, other factors may be used to determine the selectionof the set of reference genomes to use. Such other factors can includethe magnitude of the differences between the “best” and “second-best”alignments, the genomic sequence diversity corresponding to the detectedmicroorganism (e.g. viral genomes are most diverse than bacterialgenomes), and the completeness of the available reference databases,including the availability of “near-neighbor reference genomes (i.e.genomes that are closely matched in the reference database).

At step 450, the system assigns a taxonomy identifier from theclassification information to each of the two or more of the classifiedreference genomes. The taxonomy identifier may include at least twolevels of classification (e.g., species, genus, family, etc.) and thelevels may have a hierarchy such that there is a lower level (e.g.,species) and at least one higher level (e.g., genus, family, etc.). Forexample, using the example of FIG. 7, the system may assign thetaxonomic identifier including a species of human rhinovirus species, agenus of enterovirus, and a family of picornaviridae to the firstreference genome and a species of human enterovirus K1105_171204, agenus of enterovirus, and a family of picornaviridae to the secondreference genome.

At step 460, the system compares the assigned taxonomy identifier ofeach of the two or more classified reference genomes at each of the atleast two levels of classification. For example, using the example ofFIG. 7, the system may compare the species, genus, and family of thehuman rhinovirus sp and the human enterovirus K1105_171204 referencegenomes. The system may begin at the lowest level (e.g., species orsub-species level) and compare whether they are the same.

At step 470, the system removes each level of the at least two levelsfrom the assigned taxonomy identifier that do not match between the twoor more classified reference genomes. For example, using the using theexample of FIG. 7, the system may determine that the sub-species andspecies between the human rhinovirus sp and the human enterovirusK1105_171204 reference genomes are not same and may remove those levelsfrom the results. Thus, if differences are present between the levels ofclassification assigned to each of the multiple reference genomes, thelower levels of classification between the multiple reference genomesmay be removed until all the levels match.

At step 480, the system may assign to the sequence read the lowest levelof the at least two levels of the assigned taxonomy identifiers that isshared between the two or more classified reference genomes. Forexample, using the example of FIG. 7, the system may determine that thelowest shared level that is shared is the genus level of enterovirus forthe human rhinovirus sp and the human enterovirus K1105_171204 referencegenomes. Accordingly, the system may assign the lowest level of thetaxonomic identifier for the two reference genomes to be the genus levelof enterovirus for the two reference genomes. In such a case, theidentifier for the genus level would be assigned to the sequence read.If none of the at least two levels of the taxonomy identifier match, anunassigned state may be associated with the sequence read.

As shown in FIG. 9, the species results for both the human rhinovirus spand the human enterovirus K1105_171204 reference genomes have beenremoved from the results table, and instead the reads have been combinedinto a genus match for the enterovirus that has a read of 10,461sequence read hits. Accordingly, the results indicate that 10,461sequence reads aligned with reference genomes that were shared betweenvarious species within the enterovirus genus, but that were not specificto a particular species. Accordingly, the results are simplified for aclinician.

At step 490, the system determines whether all of the reads matchingmultiple reference genomes sequences have had taxonomic classificationprocessing applied. If there are additional reads matching multiplereference genomes to be analyzed, the system may repeat the process forthe next sequence read in the initial alignment results (and may returnto step 430). This process may continue until all of the sequence readsmatching to two or more reference genome sequences have been analyzed toensure the reference genomes have the appropriate level ofclassification.

At step 495, the system provides an identification of one or moretaxonomy identifiers corresponding to one or more candidate pathogensbased on numbers of corresponding sequence reads assigned to each of theplurality of taxonomy identifiers. The identification can be thetaxonomy identifiers with or without additional information. Forexample, the number of reads assigned to each of the identifiers may ormay not be included. The fact that the identifier(s) are being providedcan indicate that the identifier(s) are candidate pathogens, i.e., asufficiently high likelihood of existing in the sample based on thenumber of reads assigned. Various criteria can be used to determinewhether a pathogen is identified as a candidate pathogen, e.g., usingbest match algorithm for picking one or more identifiers (as describedherein) and criteria for determining whether the best matched taxonomyidentifier provides a sufficient match to be identified as a candidatepathogen.

In some embodiments, sequence reads having an assigned taxonomyidentifier can be used to identify the one or more taxonomy identifierscorresponding to one or more candidate pathogens. For each of aplurality of taxonomy identifiers, a total score can be determined basedon at least coverage of a reference genome corresponding to the taxonomyidentifier. The total scores can be ranked, and one or more taxonomyidentifiers exceeding a threshold can be identified.

In one embodiment, updated alignment results can be provided to aclinician where the updated alignment results include a number of readsfor each of the plurality of taxonomy identifiers. As shown in FIGS. 7,9, and 10, the effect of taxonomic classification is to properlyclassify reads to their correct taxonomic level. Two reads (751 and 752in FIG. 7), one aligning to rhinovirus A 752, and the other aligning tohuman enterovirus K1105 751, are correctly bumped up to the genus level(genus Enterovirus) in FIG. 9. By doing the taxonomic classification,the interpretation of the data is that there is only one species ofenterovirus identified with a large number of sequence read alignments(enterovirus D) in the sample (yellow shading), and not two differentspecies (rhinovirus A and enterovirus D, for example). Table 1000 ofFIG. 10 shows a results table 1000 for classification and filtering.

In some embodiments, the classification of a read can begin oncealignment results for that read are obtained. Thus, the alignmentresults for all reads are not required before classification can begin.Accordingly, the two processes may be performed in parallel (e.g., inseparate threads) with the output of the alignment thread being used bythe classification thread, with the classification thread operatingslightly delayed as needing to wait for the alignment results for afirst read.

Once the candidate pathogen(s) are identified, additional availableclinical and laboratory data can be helpful in determining whether ornot the detected organism is pathogenic (i.e. causing disease) in thehost organisms (e.g. human patient). The detection of the presence of apotential pathogen in a clinical sample does not necessarily mean thatit is causing disease; the potential pathogen could be a colonizer, forinstance, or a bystander and have nothing to do with the host organism'sillness. If the candidate pathogen is deemed to be pathogenic byclinical and other criteria, the detection can be used to guide clinicalinterventions, which can include (1) drug therapy (e.g. prescribing oradministration of a targeted antimicrobial agent), (2) drugdiscontinuation (e.g. discontinuing a drug that was administeredempirically in the absence of a definitive diagnosis), (3) vaccination,if a vaccine is available and efficacious after infection (e.g. rabies),and (4) medical procedures (e.g. valve replacement in cases of fungalendocarditis, for which antifungal therapy alone is ineffective). Thefailure to detect a candidate pathogen may also be clinically useful to“rule out” infection as the cause of illness, which can guide cliniciansto work up and treat for non-infectious causes (e.g. administeringintravenous immunoglobulin and steroids for autoimmune disease).

Accordingly, when the sample of biological material is from a hostorganism, a clinical intervention can be performed for the host organismbased on the identification of the one or more taxonomy identifierscorresponding to the one or more candidate pathogens. The clinicalintervention can include the examples above. The intervention caninclude actually performing/administering any of the above procedures orprescribing them.

B. Minimum Alignment Threshold

The alignment results can provide a distance (e.g., an edit distance)between the sequence read and the matching region of a reference genome.The distance can provide a measure of how many nucleotide differencesare there between the actual sequence read and the reference. An editdistance can be the minimum number of editing operations (e.g.,insertion, deletion, and substitution) to transform the read into thereference genome, or vice versa. Using the edit distance can help toperform the classification quickly.

Once a distance is obtained, the reference genomes can be ranked by thedistance. The taxonomic classification can place the read at differentspecies, genus, or family level based on edit distances to differentmatching reference genomes. Thus, the taxonomic locations for each ofthe reference sequences can be determined.

The desired edit distance for determining matched alignments will varydepending on the sequence diversity of the organism. Thus, the minimumalignment threshold can vary depending on which reference genomes areinvolved. For instance, viruses have more divergent genomes (meaningthere is a lot of diversity). For viruses, a relative large thresholdcan be used, e.g., up to 12 mismatches in the fixed read length. Forbacteria and fungi, a smaller distance may be tolerated becausebacterial genomes that fit in the same species are highly identical; wewant to be able to not only identify the species but we want to bespecific.

For instance, if there are two different subtypes of enteroviruses, 70and 71, there can actually be up to 30% difference by sequence betweenthose two viruses. On the other hand, two bacterial species, such asstaph aureus and staph epidermidis, can be 99% identical across thegenome. Edit distances can be used as a criteria for classifying thesereads, and different taxonomic groups you would require different editdifferences. A suitable matching threshold distance can be determinedempirically meaning from clinical data or data generated from positiveand negative controls where it is known what is in the sample, e.g.,what pathogens are in the sample.

IV. Filtering Reference Genome Sequence Hits

One embodiment of the present invention is directed to filtering outsequences that appear to be false positives and/or are incorrectlyannotated and/or classified. For example, some reads may initiallyappear to align to a particular classified reference genome using afirst alignment algorithm (e.g., a global alignment algorithm like, forexample, SNAP or Needleman & Wunsch) but a more comprehensive alignment(e.g., using a local alignment algorithm like, for example, “Basic LocalAlignment Search Tool” (BLASTn) or Smith & Waterman) shows that even abest read can actually align to a different reference sequence usingdifferent alignment algorithms.

Furthermore, some reference genomes with a GenBank or other largedatabase of classified reference genomes may be mis-annotatedtaxonomically (i.e., are classified incorrectly). Additionally, somereference genomes may include portions of sequences that can be assignedto multiple taxonomies. If a sequence in the dataset aligns to the“erroneous” portion of that reference genome sequence, it will bemis-assigned by an alignment process. For example, an HIV viral sequencewith flanking human integration sites may be annotated as HIV and maycause erroneous matches and identification of potential pathogens. Assuch, embodiments are directed at filtering these “erroneous” referencegenomes using a filtering algorithm.

Accordingly, embodiments of the present invention can be applied after asequencing alignment analysis to help filter, categorize, and interpretthe results of the sequencing analysis to identify a smaller list ofpotential pathogens to a more manageable and easily interpretableresult. Thus, embodiments may be directed at filtering results once asystem has the raw reads, raw contigs, and assembled contigs aredetermined. Embodiments can take the reads and annotate them with whatthe hits were from the alignment analysis. For example, the system maytake the results of the analysis and determine how accurate and/or goodof a hit a result is. The system can take that annotated raw output andfurther filter it to remove mis-annotated results and false positivealigned reference genomes. Generated result tables showing thealignments and the coverage maps can be determined and displayed afterfiltering.

A. Filtering Method

FIG. 5 shows an exemplary filtering process 500 for removingfalse-positive and/or other erroneous reference genome sequences fromalignment results. Before the process shown in FIG. 5, the system mayreceive a plurality of sequence reads obtained from a sequencing of DNAmolecules from a sample of biological material that includes multipledifferent organisms. In some embodiments, the alignment results may havebeen previously taxonomically classified, e.g., using process 400.

At step 510, alignment results are obtained in response to applying aninitial alignment technique (e.g., using a global alignment algorithm)to a plurality of sequence reads from a sample to align the sequencereads to reference genomes in a global database. The alignment resultsinclude a matching reference genome for each of at least a portion ofthe sequence reads in the sample that the sequence read aligns. Forexample, the alignment technique may include a global alignmentalgorithm that takes each sequence read and searches a database ofclassified reference genomes for aligned portions of the sequence readsto portions of the classified reference genomes. The results may includemultiple aligned reference genomes for each of the reads, and eachreference genome sequence may have many different reads that align to aportion of the reference genome.

The alignment results may include a plurality of sequence reads,sequence read identifiers, a plurality of aligned classified referencegenomes, reference genome identifiers from the database of referencegenomes, an alignment and/or similarity measurement for each of thealignments, a taxonomy identifier and/or other classificationinformation associated with each of the reference genomes, and any othersuitable information that may be used in alignment and identificationprocesses. For example, the source and/or database identifier where thealigned reference genome was stored, the source where the referencegenome was provided from (e.g., hospital, clinic, company, provider,etc.), and the data and/or time of the upload may also be provided.

FIG. 6 shows an exemplary summary 620 of sequence alignment results forthree different samples. Column 610 shows read type. The first sample630 and second sample 640 are known to be entrovirus D68, and the thirdsample 650 is a control that aligns with hepatitis C. The three samplesshown in FIG. 6 include a result table that shows the number of initialsequence reads (e.g., for sample 1 there are 2,159,196 sequence reads),the number of sequence reads after preprocessing (e.g., for sample 1 thepreprocessing results in 1,933,046 sequence reads), the number ofsequence reads that align to human reference sequence genomes (e.g., forsample 1, 1,725,425 reads align to human reference sequence genomes),the number of reads that matched to a reference genome (e.g., for sample1, 133,932 reads aligned with a reference genome from the database ofreference genomes), and the number of bacteria and viruses matched(e.g., 133,827 and 21, respectively for sample 1). Further, the sequencestatistics shows the percentage of human matches (e.g., 89.3% forsample 1) and the percentage of reads that remained after preprocessing(e.g., 89.5% for sample 1).

Additionally, FIG. 7 shows an exemplary result table 700 for initialsequence reads. The result table includes the multiple reference genomesthat matched with at least one sequence read, a taxonomic identifier foreach reference genome including a species (710), genus (720), and family(730) classification for each reference genome. The species, genus, andfamily classifications are hierarchical with the species level being thelowest level, the genus level being above the species level, and thefamily level being above both the species and the genus level. Thetaxonomic identifier and the corresponding classification informationmay be provided by the classification information (e.g., an annotationentry) of an aligned reference genome from the database that was usedduring alignment.

Further, the results table may include tag (740) for each referencegenome that may identify the database that the reference genome wasaligned with and the type of reference genome (e.g., human, plant,bacteria, etc.). Finally, the result table may include the number ofreads that aligned to each reference genome for each sample. Forexample, for the first sample (column 750) shown in FIG. 7, there are133,004 reads aligned to Enterovirus D; 247 reads aligned to Enterovirussp., 574 sequence reads aligned to Human Enterovirus, 1 read aligned tohuman enterovirus K1105_171204, and 1 read aligned to human rhinovirussp. Accordingly, initial alignment results may include any alignmentmatches to any of the reference genomes that are present in thedatabase. The second sample (column 760) has a similar pattern as forthe first sample 750. The negative control (column 770) shows a matchfor hepatitis C, as is expected.

At step 520, the system identifies all reference genomes that matched toa read from the sample. For example, for the alignment results of sample#1 shown in FIG. 7, five different reference genomes were aligned withsequence reads from the sample.

At step 530, the system identifies an optimally-aligning sequence readthat aligns to the matching reference genome with an optimal alignmentscore that exceeds or is equal to alignments scores of other sequencereads that align to the matching reference genome. Theoptimally-aligning sequence read arid optimal alignment score can beidentified based on the initial alignment results. The optimally-alignedsequence read may be determined by analyzing the reads that aligned withthe reference genome and comparing their alignment value for the closestmatch. Any process may be used to identify the best match, for example,the read with the most significant match may be defined in SNAP byminimum edit distance and RAPSearch by minimum e-value.

At step 540, the system applies a second alignment technique for theoptimally aligning sequence read to the plurality of classifiedreference genomes to obtain a plurality of new alignment scores. Thesecond alignment technique may include a different alignment technique,a different reference database, and/or a combination thereof. Forexample, a local alignment algorithm may be applied for the secondalignment technique that may take longer to process for each sequenceread but because there are fewer sequence reads to process (e.g., onlythe identified optimally-aligning sequence reads from step 530), theprocessing time may not overly delay the identification process.

For example, the first alignment technique may include a globalalignment algorithm (e.g., SNAP/RAPSearch) that provides fasteranalysis, but may be sensitive to poor quality sequence regions. Thesecond alignment technique may include applying a local aligner (e.g.,BLASTn, Bowtie2, etc.) that may be much slower (1,000×-10000× slower),but that is less sensitive to poor reads and may provide a differentalignment of the sequence read that can be compared to the results ofthe first alignment technique. Thus, the use of the two differentalignment techniques allows the system to correct rare “false positives”that occur from the first alignment technique (e.g., SNAP alignment). Byonly selecting one “test hit” read per reference genome for the secondalignment technique (e.g., BLASTn confirmatory alignment), the systemcan greatly speed up the turnaround time of the process.

At step 550, the system compares the re-alignment results obtained usingthe second alignment technique for the optimally-aligned read with theinitial alignment results obtained using the first alignment technique.For example, different reference genomes may be aligned with thesequence read using the second alignment technique that were not alignedand/or different alignment scores may be returned with the re-alignmentthan the original alignment.

At step 560, the system determines whether the alignment results usingthe first alignment technique match the alignment results using thesecond alignment technique. The comparison of the first alignmentresults and the second alignment results to determine a match may beaccomplished through any suitable manner. For example, the system maydetermine whether any of the new alignment scores exceeds the optimalalignment score for the optimally aligning sequence read. In otherembodiments, the system may determine whether the same reference genomewas aligned with the sequence read. Further, other embodiments maydetermine whether the aligned reference genome sequences share the sametaxonomic identifier (e.g., the two reference genome sequences share thesame family, genus, and species classifications).

At step 570, if the alignment results for the sequence read aredetermined not to match, the system may remove the reference genomesequence from the results and all corresponding sequence reads thataligned with the reference genome sequence. In some embodiments, if asequence read aligned with multiple different reference genomes, onlythe association of that read with the currently analyzed referencegenome sequence may be removed and the read itself (along with thealignment result to the other reference genome sequence) may not beremoved. Thus, for the exemplary comparison method of comparing thealignment scores discussed above, when one new alignment score exceedsthe optimal alignment score for the optimally aligning sequence read,the system may remove the matching reference genome from the set ofmatching reference genomes and the corresponding sequence readsassociated with the reference genome. In some embodiments, if the tworeference genomes share a same taxonomic identifier, the matchingreference is not removed.

For example, FIGS. 7-8 shows a table of example results where filteringhas been applied to remove mis-annotated and false positive alignments.FIG. 7 shows the raw initial alignment results and FIG. 8 shows theresults 800 after filtering has been applied. As shown in FIG. 8, theeffect of filtering is to remove two mis-assigned reads that occur as aresult of misannotations in the database (761 and 771). These reads areannotated as “hepatitis B” and “simian virus 40” but actually are humanintegration sites that have been mis-annotated as viral. Using a localaligner (e.g., BLASTn), the system determined that these two reads areactually human reads (returning human reference genomes), and aretherefore removed from the output when the alignment results do notmatch with the initial alignment results of “hepatitis B” and “simianvirus 40”.

At step 580, if the alignment results for the sequence read aredetermined to match, the system may consider the reference genome a realmatch that is not mis-annotated and may maintain the reference genomeand aligned sequence reads in the alignment results.

At step 590, the system determines whether all of the reference genomeshave had the filtering process applied. If there are additionalreference genomes to be analyzed, the system may repeat the process forthe next aligned reference genome in the initial alignment results (andmay return to step 530). This process may continue until all of thealigned reference genomes have been analyzed for false-positive matchesand are removed and/or confirmed as matches.

At step 595, once the system has applied the filtering process to all ofthe aligned reference genomes, the process may provide the updatedalignment results (e.g., not including the selected reference genomesfrom step 570) to the sample alignment system for display to anoperator/clinician. In some embodiments, just the set of matchingreference genomes is output. In some embodiments, additional processingmay be applied to the updated alignment results before returning theupdated alignment results to the sample analysis computer, as describedherein.

Accordingly, embodiments may apply a filtering algorithm to the resultsthat identifies all the false positive hits, and systematically removefalse positive hits corresponding to that genome. The system canidentify the particular genomes that provided a sufficient qualityalignment. The system can then further analyze those reads to removefurther alignments that do not appear to be high quality. The filteringalgorithm can apply an even more stringent criteria to remove hits thatare likely not real and instead are due to a misannotation of areference genome database and/or a misclassification.

For example, a particular read might have hit two, five, or more hits.If a genome had a single hit then it is likely an accurate hit, but if aread hit ten different reference genomes, then it is not likely that thehit is of a high quality. Thus, the filtering could remove all thosehits and a new summary table and/or coverage map for the genome.Further, if the coverage map for a particular genome has now drasticallyreduced due to removing the false positive hits, then the system mayremove the genome as a potential pathogen or cause of the illness.Accordingly, embodiments of the present invention may filter what wasfifty results (e.g., fifty different bacteria) down to the five bacteriathat are the most high quality and likely real results. This allows formuch easier and faster clinical analysis and/or interpretation.

In some embodiments, methods 400 and 500 can be combined in a singlepipeline, e.g., in a research pipeline. In other embodiments, method 500may not be performed, e.g., a clinical pipeline might not include method500. For example, the taxonomic classification may be sufficientlyaccurate that filtering may not be needed. And, such a pipeline canoperate faster without the filtering step, particularly one that usesBLASTn. The turnaround time for a clinical pipeline may be desiredwithin 4-6 hours, whereas more time is acceptable for a researchpipeline. Thus, filtering, RAPSearch translated nucleotide alignment,and contig assembly may be retained in a research pipeline, but droppedfor a clinical pipeline.

B. Ribosomal RNA Sequence Removal

Additionally, in some embodiments, ribosomal RNA sequences may beremoved from the alignment results. Sequences derived from ribosomalregions are difficult to speciate/classify accurately due toconservation of the ribosomal sequences. Accordingly, embodiments mayremove ribosomal sequences from the alignment output by aligningsequence reads that had been assigned to bacteria against bacterialribosomal 16S/23S reference genome sequences. Additionally, the alignedsequence reads assigned to eukaryotes (fungi and parasites) may befurther aligned to 18S/28S GIs. Any sequence reads that align to thesebacterial or eukaryotic ribosomal reference genomes may be removed fromthe alignment results.

C. “On-the-Fly” Annotation of Database of Classified Reference Genomes:Automated Clean-Up of the Reference Database

Further, due to the rapidly expanding number of reference genomes storedin the classified reference genomic sequence database, such databasesmay double every 3 years or faster. However, there are very few controlsover who or the quality of those providing reference genomic sequences.Accordingly, embodiments can includes automated scripts to (1) downloadall sequences from the GenBank (currently ˜72 GB of data), (2) removeall entries that are poorly annotated by screening out key words in thedescription of a reference genome sequence entry (e.g., “uncultured”,“unclassified”, “environmental”), (3) remove all vector sequences, and(4) chop up the classified reference genome sequence database (e.g.,GenBank) into “chunks” and convert them into reference databases to beused by local and global alignment algorithms (e.g., SNAP andRAPSearch). A database may be chopped up so as to fit into memory.

V. Removal of Background

As mentioned above, a negative control can be used to identifycontaminants or other microorganisms that are not clinically relevant.Such remove of background microorganisms is particularly relevant toclinical applications, where diagnosis and treatment of a patient are agoal. Embodiments can provide one or more criteria for discriminatingbetween a pathogen and a contaminant.

In some embodiments, one or more control samples can be analyzed inparallel (simultaneously) with one or more patient samples (e.g., 5-8patient samples). A positive control can consist of several differentorganisms spiked into negative matrix, e.g., spinal cord fluid. Anegative control sample (also a called a no-template control) has nospiked organisms. The negative control can be just the buffer that isused for a PCR reaction, which is part of the library preparation forsequencing.

A. Types of Controls

The positive control can be used to confirm that the microorganisms inthe spiked contamination are identified. The positive control can ensurethat the procedure is robust. For instance, if the sample is like abloody fluid, the heme in blood may actually inhibit PCR. There can be anegative result, but it might be a false negative. Thus, if the positivecontrol is negative, then the sample can be identified as a falsenegative. The external positive control can ensure that there are acertain number of reads for each of the spiked organisms (e.g., 7different organisms), in order for the run to pass quality control.

With the no template control (e.g., a buffer for a PCR reaction or otherlibrary preparation step), any reads classified (i.e., hits) as aligningto a microorganism can be identified as corresponding to background. Asexamples, such false hits can correspond to: reagent contamination,contamination introduced in a laboratory, or contamination introducedfrom other samples (cross-contamination). In this manner, the number ofhits can be reduced, and proper treatment for an actual pathogen is morelikely to be performed.

An internal spiked control can include organisms spiked into a patientsample. For example, a DNA phage and/or an RNA phage can be spiked inspecific concentrations in a clinical sample. Embodiments can check tomake sure that the phages can be detected in the DNA library, e.g., asufficient number of sequences from the DNA phage. Similarly, a checkcan make sure that the phages are detected in the RNA library, e.g., asufficient number of sequences from the RNA phage. This control can bean internal control on every sample, in addition to external control.

B. Normalization

After analyzing the data from the sequencing of the samples (e.g.,aligning, classifying, filtering), a list of potential candidatepathogens can be obtained. The list can be a table showing a number ofreads seen for each of a plurality of species or other taxonomicclassification. The number of reads can be normalized. For instance, thenumber of reads can be determined per million reads. If there are 10hits and 10 million reads, the RPM (reads per million) will be one. Thetotal reads can be raw reads that were generated per clinical sample.But, there can be further normalization.

An RPM can also be determined in the negative control. An RPM ratio canbe determined from the reads per million (i.e., of a particulartaxonomic classification) in the sample divided by your reads permillion in the negative control. For example, assume there are 10normalized reads for a virus HIV) in the sample, and two normalizedreads in the no-template control. In that case, the RPM ratio would befive. If there are no reads in the negative control, then there would bea division by zero error. In such cases, a value of one can be used asthe RPM from the NTC sample. In other words, the RPM ratio can be thereads per million in the sample.

The RPM ratio allows for easier discrimination of what is really in thesample and what is background. A threshold (e.g., 10) can be used forthe discrimination. Thus, pathogens with an RPM ratio of greater than 10can be identified as clinically significant. Using this criteria, verygood sensitivity and specificity can be obtained. The use of the RPMratio reduces the number of false positive. In particular, very goodtest performance is obtained relative to using the Bayesian probability,which is based on just the number of reads in a sample. Suchprobabilistic method can be used when determining an error model, and donot involve normalization from reads in a negative control.

Tables 1-4 below shows revised accuracy data with and withoutdiscrepancy testing for results-based and sample-based testing.Discrepancy testing refers to the use of an orthogonal (i.e. differentkind of) test to resolve discrepant results between the mNGS assay andconventional clinical microbiology laboratory testing. For instance, ifa sample is found to be positive for Mycobacterium tuberculosis by mNGSbut is negative by culture (because some Mycobacterium tuberculosisstrains grow slowly or not at all in culture depending on pathogen titerand sample type), a Mycobacterium tuberculosis PCR test can be used asan orthogonal test for discrepancy testing, and the results of theorthogonal test are taken as correct for the purposes of determiningaccuracy.

For sample-based testing, for each sample, performance is evaluated onlyas it pertains to detection of the original organism reported by thereference lab. For samples completely negative by reference lab testing,the ability to detect all 5 organism types is evaluated.

For result-based testing, for each sample, performance is evaluated asit pertains to detection of all 5 organism types (bacteria, fungi, DNAvirus, RNA virus and parasites). Only results for acceptable samples(passed quality control metrics of >5,000,000 reads per library and >10RPM of either an internally spiked DNA control, T1 bacteriophage, forDNA libraries, or an internally spiked RNA control, M2 bacteriophage,for RNA libraries; sufficient sample volume) are provided in the tablesbelow.

Table 1 shows sample-based accuracy (most clinically significant resultper sample) without discrepancy testing.

TABLE 1 Sample-based accuracy (most clinically significant result persample) ClinMicro Reference Lab Total Pos Neg Untested Total mNGS Pos 597 7 73 Neg 25 90 115 Total 84 97 7 Sensitivity TP/(TP + FN) 70.2Specificity TN/(TN + FP) 92.8 Accuracy (TP + TN)/ 82.3 (TN + FN + FP +FN)

Table 2 shows results-based accuracy (includes multiple positive inegative results per sample) without discrepancy testing.

TABLE 2 Results-based accuracy (includes multiple positive/negativeresults per sample) ClinMicro Reference Lab Total Pos Neg Untested TotalmNGS Pos 60 7 31 98 Neg 27 404 431 Total 87 411 31 Sensitivity TP/(TP +FN) 69.0 Specificity TN/(TN + FP) 98.3 Accuracy (TP + TN)/ 93.2 (TN +FN + FP + FN)

Table 3 shows revised sample-based accuracy after discrepancy testing.

TABLE 3 Revised accuracy after discrepancy testing, sample-basedClinMicro Reference Lab Total Pos Neg Untested Total mNGS Pos 68 7 22 97Neg 11 327 338 Total 79 334 22 Sensitivity TPR TP/(TP + FN) 86.1Specificity TNR TN/(TN + FP) 97.9

Table 4 shows revised results-based accuracy after discrepancy testing.Only results for acceptable samples are provided.

TABLE 4 Revised accuracy after discrepancy testing, results-basedClinMicro Reference Lab Total Pos Neg Untested Total mNGS Pos 47 6 18 71Neg 8 303 311 Total 55 309 18 Sensitivity TPR TP/(TP + FN) 85.5Specificity TNR TN(TN + FP) 98.1

After discrepancy testing and including results passing IC withsufficient sample volume, overall sensitivity increases to 86.1%(sample-based) and 85.5% (results-based), while maintaining highspecificity. Given the difficulty making an etiologic diagnosis in manycases of encephalitis/meningitis, a test sensitivity of 80-90% withspecificity >95% is very useful in patient management decisions. Heabove data shows that the overall test accuracy is acceptable forclinical use.

The tables can be displayed in a visualization program. The tables canbe provided in a user interface, in a normalized or non-normalizedmanner. A normalized table is easier to identify pathogens than to lookat a list of hundreds of hits. For example, many hits can be removedafter normalization to obtain the RPM ratio, and use of a threshold.

C. No-Template Control Database

In addition to the use of the no-template control for a givenexperimental run, embodiments can use a database of no-template controls(NTC database). For every run, the database can be reviewed before acall is made about a pathogen being in the sample, e.g., to make surethat organism is not in the NTC database. The NTC database includeshistorical data about what pathogens were identified in NTC samples,e.g., within a specific amount of time, such as 30 days.

For example, a herpes virus might be identified in an NTC sample a monthago (e.g., due to contamination resulting from the same lab doing herpesvirus testing). Herpes virus 1 is a common cause of meningitis, and thuscan be a dangerous pathogen, if it actually existed. The NTC databasecan include herpes virus 1, until it is not seen in an NTC sample for atleast 30 days. In some embodiments, if the taxonomic classification isfound on the skin like poly papilloma viruses, they can be excludedoutright. But, they can still be included as part of the NTC database.

Very often, background includes skin flora. For instance, papillomaviruses may be seen in clinical samples in the no-template control. Ifwe do see such a virus in a no-template control, it is added to thedatabase. Embodiments may not actually report organisms that were foundin the no-template control unless they are present at a much higherlevel. For example, assume the RPM was 10 in the NTC for one run, thenthe taxonomic classification gets added to the database with an RPM of10. For the next run, if there was a herpes virus detected at 10 RPM(i.e., an RPM ratio of 1), it would not be reported when a threshold ofgreater than 10 for the RPM ration was used. It would have to be at anRPM of 100 to provide an RPM ration of 10 times the highest level thatwas present in the NTC database.

As it is not desirable to keep these pathogens in the database becausethat would affect the performance of the test, as the NTC databasedecreases the sensitivity, since the bar is set higher to make apositive call for the pathogens in the NTC database. Thus, the NTCdatabase is an evolving database. For example, if a pathogen has notbeen present for longer than three months (or other specified timeperiod), it then gets removed from the database. Steps can be performedto remove contaminants from NTC samples, e.g., replacing reagents. Theamount (e.g., RPM) in the NTC database for a particular contaminant canbe updated from the initial value to a new value if the new value isgreater than the initial value. Accordingly, in one embodiment, theamount in the NTC database for a contaminant can be the maximum that isobserved.

D. Flowchart for Use of NTC

FIG. 11 is a flowchart of a method 1100 for identifying pathogens in atest sample of biological material according to embodiments of thepresent invention. Method 1100 may be performed by a computer system.The test sample includes DNA molecules from a plurality of organisms,e.g., from a patient and from microorganisms.

At block 1110, a plurality of sequence reads are obtain from asequencing of DNA molecules from the test sample of biological material.The sequence reads can be received at the computer system in a varietyof ways, e.g., via a network or a removable storage device. The computersystem can also receive information about sequence reads from a negativecontrol sample. For example, the computer system can receive an amountof sequence reads for a plurality of difference reference genomes, e.g.,as classified into various taxonomic levels.

At block 1120, an alignment technique is used to align the plurality ofsequence reads to a plurality of classified reference genomes in adatabase. The reference genomes may be considered classified in that thegenomes are classified to at least one classification level in ataxonomy. Alignment results can include, for each of at least a portionof the sequence reads, at least one matching reference genome to whichthe sequence read aligns. The alignment results can includeclassification information for each of the matching reference genomes.

Blocks 1130-1160 may be performed for each of a group of one or morematching reference genomes.

At block 1130, a first amount of sequence reads from the test samplethat align to the matching genome is determined. As an example, thefirst amount may be a total number of sequence reads aligned to theparticular matching reference genome. The first amount may be normalizedbased on the number of sequence reads obtained from the test sample, orthe total number of sequence reads aligned to at least one referencegenome

At block 1140, a second amount of sequence reads from a negative controlsample that align to the matching genome is determined. In oneembodiment, the second amount of sequence reads can be determined from anegative control database. In another embodiment, the second amount ofsequence reads can be determined from a parallel sequencing of anegative control sample. Amounts from a parallel negative control sampleand the negative control sample can be combined to provide a singleamount.

At block 1150, a ratio of the first amount and the second amount isdetermined. The ratio can take various forms, such as the first mountdivided by the second amount, or the second amount divided by the firstamount. Further, a numerator or denominator can include a sum of the twoamounts.

At block 1160, the ratio is compared to a threshold to determine whetherthe ratio exceeds the threshold. A set of one or more matching referencegenomes can have the ratio exceed the threshold. The threshold can bedetermined based on empirical data (e.g., values of samples with a knowncomposition), as will be appreciated by one skilled in the art. Theselection of the threshold can be based on a desired balance betweensensitivity and specificity.

At block 1170, an output is provided that identifies the set of one ormore snatching reference genomes as potential pathogens in the testsample. As examples, the output can be a list of the set of matchingreference genomes, e.g., presented in classification levels. Such a listcould include more matching reference genomes, where the set isindicated, as may be done with a marker or the ratio. For instance, anRPM ratio can be provided.

VI. Further Data Analysis

Once the false-positives and inaccurate reads have been removed, thesystem may apply post-filtering analysis and provide the results to thesample analysis system for display the clinician. In some embodiments,the filtered and classified results may be used to generate new coveragemaps. The filtered and classified results may be analyzed to determine abest match and whether a sufficient match obtained. For example, a bestmatch score (or any one of the individual score) can be required to beabove a specified threshold. In some embodiments, if the coverage isbelow a certain level, than the system will dismiss the organism asbeing a possibility. In other embodiments, all possible identificationresults could be provided to allow a clinician to make anidentification.

Further, in some embodiments, a best match algorithm may be applied toselect, for instance, the best match among a variety of closely relatedmatches. For instance, the results may have hits to multiple differentstrains of influenza. The best match algorithm may choose the mostclosely matched sequence in the reference database and select thatclassified element as the best match for the sample sequence.

Examples of a best match algorithm are described herein, e.g., insection II.C. In another example, for each species, the reference genomesequence with the greatest number of reads assigned to it can be used asa reference for mapping of all of the species-specific reads. A“consensus sequence” may be generated from the mapping. The globalpairwise identity using the Needleman-Wunsch algorithm may thencalculated between the “consensus sequence” and each reference genomesequence. The reference genome sequence with the greatest globalpairwise identity may be selected as the top reference genome sequence.The reference genome sequences under consideration can also beprioritized as follows: (1) complete genomes, (2) complete sequences, or(3) partial sequences/individual genes.”

A positive pathogen identification may be reported where for each topreference genome a positive identification metric may be evaluated andprovided to the clinician in a return message to the sample analysiscomputer. For example, the positive identification metric may includecoverage of at least 3 genes/open reading frames (ORFs), or where thereare fewer than 3 ORFs in the virus, then coverage that exceeds greaterthan 3 times the read length (e.g., for 100 bp sequencing, coverage ofgreater than 300 bp).

Additionally, systems may implement a tagging mechanism for thereference genomes that are matched to the sample sequence reads in orderto make pathogen identification easier. For instance, the system may addthe “host” to viral reference genome sequences (GIs), where the “host”is bacteria, for instance, the system knows that the virus is a phage(which the system may want to mask out sequences to phages). Similarly,the system can apply arbitrary labels such as “pathogens”, “colonizers”,and “laboratory contaminants” to specific bacterial GIs that can behelpful in downstream visualization of results for clinicalinterpretation. These results may be provided with the final results ofthe alignment, filtering, and classification process.

VII. Example for Neurobrucellosis

A diagnosis of brucellosis can be difficult because routine culture andserological methods exhibit variable sensitivity and specificity. Atpresent, the standard laboratory diagnosis of brucellosis is based onisolation of the bacteria from clinical specimens and/or serologicaldetection of Brucella antibodies.

Neurobrucellosis is a known complication of systemic Brucella infection.It remains a difficult diagnosis to make and can mimic other fastidiousinfections such as TB [1]. Manifestations of neurobrucellosis are widelyvariable and include meningoencephalitis, cerebrovascular disease,peripheral and cranial neuropathies, or myelitis. Neuroimaging studiesof neurobrucellosis can also vary greatly among individuals [6]. Aprevious study from Turkey reported a high prevalence ofneurobrucellosis at 37% among 128 patients with brucellosis. Althoughculture is the gold standard for diagnosis, Brucella species arerelatively fastidious and slow-growing; cultures fail to recover theorganism 30-90% of the time, and also present a risk oflaboratory-acquired infection [7]. Serology is more sensitive fordetection, but can lead to false-positive results and may notdistinguish between active and prior infection.

Molecular methods based on detection of nucleic acid such as PCR [8] andnow potentially metagenomic NGS [13] can offer increased sensitivity andspecificity over conventional diagnostic testing. We present the use ofa metagenomic next-generation sequencing assay to diagnose a case ofneurobrucellosis from cerebrospinal fluid (CSF), resulting in theinstitution of appropriate antibiotic treatment and a favorable clinicaloutcome.

A. Case Report

Initially, microbiological testing was positive for Epstein-Barr virus(EBV) and human herpesvirus 7 (HHV-7) from CSF. Testing for otherpathogens, including Brucella by IgM antibody, was negative. The patientwas treated with parenteral acyclovir, followed by oral ganciclovir tocomplete a 14-day course. Two weeks after hospital discharge, shedeveloped back pain and worsening headache. Upon hospitalization, thepatient's vital signs were normal without fever. Physical examinationwas remarkable for multifocal myoclonus throughout her body. There washyperreflexia and decreased proprioceptive sensation of bilateral lowerlimbs. A complete blood count was notable for a white blood cell countof 4.82×10⁹/L (44% neutrophils, 44% lymphocytes, 10% monocytes).

A repeat lumbar puncture was performed, with additional microbiologicaltesting unrevealing. Cytologic examination of the CSF revealed nomalignant cells. Despite a negative tuberculin skin test andQuantiFERON-TB test, she was started on 4-drug therapy with isoniazid,rifampin, pyrazinamide, and ethambutol given a high concern for TBdisease based on the patient's suggestive CSF profile and risk factorsfor TB.

On day 8 of hospitalization, the result of CSF for Mycobacteriumtuberculosis (MTB) by PCR was negative. Given the lack of response toempiric TB therapy, there was a concern for drug-resistant TB.Ethambutol was thus changed to ethionamide, and levofloxacin was addedto the regimen. She improved substantially after changing her antibioticregimen, and was discharged home on 5 anti-TB medications. At follow up1 week and 1 month after discharge, her headache had resolved, but shecontinued to have fatigue, mild back pain, and intermittent episodes ofshaking of her extremities. When the result of the mycobacterial culturewas finalized as negative at 6 weeks, she returned to Mexico to continueher anti-TB therapy with INH and rifampin alone.

B. Results

RNA and DNA mNGS libraries from 600 μL of the patient's CSF sample wereconstructed. From a total of 23,638,587 raw reads in the DNA library,there were 277 (0.0012%) reads corresponding to the Brucella genus inthe DNA library, corresponding to an RPM ratio of 15.6, with allspecies-specific reads aligning to Brucella melitensis (see Tables 1200of FIG. 12, 1300 of FIG. 13, and 1 and 2 of the Appendix). Notably, noreads aligning to Brucella among 9,161,626 reads in the correspondingRNA library from CSF were detected, and Brucella reads were absent inboth negative “no template control” (NTC) and positive control samples.Other bacterial reads present in DNA library were also present in theNTC sample processed in parallel on the same run, and thus did not meetthe established threshold for reporting using an RPM ratio of 10 andwere attributed to laboratory reagent contamination.

Similarly, no viruses, fungi, or parasites met criteria for reporting.In contrast, all 7 microorganisms in the positive control were detectedat levels above the established reporting threshold. Importantly, noBrucella isolates or positive clinical samples from suspected orconfirmed cases had been present in the clinical laboratory prior to themNGS testing, nor had Brucella sequences ever been detected in the NTC.The presence of Brucella in the patient's cerebrospinal fluid wasconfirmed by Brucella-specific PCR testing and Sanger sequencing of theamplicon. The PCR reaction for 177 bp region of Brucella IS711 gene wasrun on a 2% agarose gel. Other samples included: a no-template-control;7 organism positive control mixture of CMV, HIV, Streptococcusagalactiae, Klebsiella pneumoniae, Cryptococcus neoformans; water; and 1kb Plus DNA ladder (Invitrogen).

After detection of Brucella reads in her CSF sample, the patient wascontacted and instructed to seek further medical evaluation. Shereturned to the hospital in Los Angeles for a second admission. Althoughshe had completed INH and rifampin therapy one week prior, she reportedpersistent back pain, nausea, and fatigue. Repeat MRI of the brain andspine was remarkable, and CSF was normal. A confirmatory serum Brucellaagglutinin titer was positive at 1:80. Because of persistent symptoms,mNGS and PCR testing showing Brucella, and positive confirmatoryserology, she was diagnosed with chronic neurobrucellosis and started ontargeted therapy with doxycycline and rifampin. Two weeks after startingtherapy, she reported that her symptoms had fully resolved. Notably,Brucella spp. was not isolated from either CSF or blood culture, nor wasBrucella DNA detected in CSF submitted for 165 rRNA gene sequencing.

This highlights the clinical impact of mNGS for diagnosis of infectionssuch as neurobrucellosis that are challenging to accurately diagnose andtreat. Although initially covered for TB meningitis using antibioticsthat partially treat neurobrucellosis (rifampin+/−levofloxacin), thepatient continued to be symptomatic and was not placed on targetedtherapy for Brucella until comprehensive metagenomic sequencing revealedthe presence of bacterial DNA in her CSF and neurobrucellosis wasconfirmed by Brucella agglutinin testing.

FIG. 12 shows table 1200 of sequencing statistics according toembodiments of the present invention. Results are shown from both thepatient's sample, no template control (NTC), and the positive batchcontrol (PC) sample.

FIG. 13 shows a table 1300 of the number of reads from DNA libraries andRNA libraries aligning to viral sequences according to embodiments ofthe present invention. Table 1300 shows the number of reads from DNAlibraries and RNA libraries aligning to viral sequences. Results areshown from both the patient's sample, NTC, and the PC sample.

In table 1300, the reported viruses are cytomegalovirus (CMV) in the“DNA PC” column and human immunodeficiency virus 1 (HIV-1) in the “RNAPC” column. These are the viruses spiked into the PC sample. No virusesare reported from the patient CSF because human papillomaviruses, partof skin flora and presumed to be a contaminant, are not reported by themNGS assay.

Supplementary Table 1 of the Appendix shows the number of reads from theDNA library aligning to bacterial sequences. Results are shown from boththe patient's sample, NTC, and the PC sample. In Supplementary Table 1,“*” indicates re-classified to a higher taxonomic rank because readsaligned equally well to multiple different organisms that shared thesame species, genus or family. The abbreviations are as follows: NTC, notemplate control; PC, positive control; and CSF, cerebrospinal fluid.

Supplementary Table 2 of the Appendix shows the number of reads from theDNA library aligning to fungal and parasite sequences. Results are shownfrom both the patient's sample, NTC, and the PC sample. In SupplementaryTable 2, “*” indicates re-classified to a higher taxonomic rank becausereads aligned equally well to multiple different organisms that sharedthe same species, genus or family.

Supplementary Table 3 of the Appendix shows bacterial taxa identified byembodiments using an RPM ratio metric for reads shown in SupplementaryTable 1. In Supplementary Table 3, “*” indicates re-classified to ahigher taxonomic rank because reads aligned equally well to multipledifferent organisms that shared the same species, genus or family; “#”indicates positive result according to an RPM (reads per million) ratio≥10, where the RPM ratio=RPM(sample)/RPM(NTC); and “&” indicates readscorrespond to the Streptococcus agalactiae PC. The abbreviations are asfollows: PC, positive control; CSF, cerebrospinal fluid; mNGS,metagenomic next-generation sequencing; and RPM, reads per million.

In Supplementary Table 3, cells that show a positive result by mNGSassay are highlighted in yellow. Under the column designated “DNA PC”,the two bacterial species reported at the pre-established thresholdcriterion of >10 RPM are Streptococcus agalactiae and Klebsiellapneumoniae, which were spiked into the PC sample. Other positive rowsare not reported because they represent either (1) classification at ahigher taxonomic rank (genus) or (2) another bacterial species in thesame genus (i.e. Streptococcus suis) that does not meet thepre-established requirement of having at least 1/10 of the number of RPMcorresponding to the predominant species (Streptococcus agalactiae) tocall a co-infection from at least 2 different species. Under the columndesignated “DNA Patient CSF”, the only bacterial taxa with >10 RPM isBrucella genus, which is the reported organism reported in the patient'sCSF.

Supplementary Table 4 of the Appendix shows fungal and parasitic taxaidentified by embodiments using an RPM ratio metric for reads shown inSupplementary Table 1. In Supplementary Table 4, “*” indicatesre-classified to a higher taxonomic rank because reads aligned equallywell to multiple different organisms that shared the same species, genusor family; and “#” indicates positive result according to an RPM ratio≥10, where the RPM ratio=RPM(sample)/RPM(NTC). The abbreviations are:PC, positive control; CSF, cerebrospinal fluid; mNGS, metagenomicnext-generation sequencing; RPM, reads per million.

In Supplementary Table 4, cells that show a positive result by mNGSassay are highlighted in yellow. Under the column “DNA PC”, the twofungal/parasitic species reported at the pre-established thresholdcriterion of >10 RPM are Cryptococcus neoformans, Aspergillus niger, andToxoplasma gondii, which were spiked into the PC sample. Other positiverows are not reported because they represent classification at a highertaxonomic rank (genus or family).

Supplementary Tables 3 and 4, which are used to interpret and reportmNGS results, show the effect of taxonomic classification andnormalization using an “RPM ratio” metric in comparison to the NTC insimplifying the clinical interpretation, as compared to SupplementaryTables 1 and 2, respectively.

Of particular relevance for this case, two chronic granulomatousinfections, tuberculosis and brucellosis, have not only overlappingclinical or radiographic features but also histologic characteristics incommon [10]. As such, misdiagnosis of tuberculosis in patients withbrucellosis has been reported in the literature [10]. This matter isfurther complicated by the fact that neither a negative CSFmycobacterial culture nor tuberculosis PCR-based assay excludes thediagnosis of TB meningitis if the clinical suspicion is high.Additionally, false-positive Brucella seroreactivity, ELISA andagglutination titer, in patients with active TB have also been reported.In this patient's case, the negative Brucella IgM but positive IgG wasincorrectly attributed to false-positive Brucella seroreactivity in thesetting of TB and not to active Brucella infection. It is possible thatthe patient may not have mounted a detectable IgM antibody response, orthat Brucella IgM levels had waned by the time of hospital admission.Confirmatory agglutinin testing may have been helpful in making thediagnosis of Brucella earlier.

This patient had temporary clinical improvement after initiation ofanti-TB medications. Rifampin and levofloxacin are two anti-TB agentsthat are also active against Brucella spp. [1]. However, rifampin, whileactive against Brucella, should always be used in combination with otheragents (e.g. doxycycline, trimethoprim-sulfamethoxazole, or quinolones)since monotherapy, as was inadvertently administered to this patientinitially, has been associated with high relapse rates [1]. Inhindsight, the patient's prolonged and indolent course was more likelyto be associated with Brucella, since more rapid clinical deteriorationwould have been expected for patients with active TB who areinadequately treated [11]. But, Brucella at that time was not high onthe differential. It is only after discharge and because of thepatient's persistent symptoms that an alternative diagnosis wasconsidered. Taken altogether, knowledge of these pitfalls is essentialfor clinicians to reduce diagnostic errors.

Metagenomic next-generation sequencing (mNGS) is an emerging approach indiagnostic microbiology with the ability to detect allmicroorganisms—viruses, bacteria, fungi, and parasites—in a single assay[2, 4, 9, 2, 3]. Here mNGS was used to provide an accurate diagnosis ofneurobrucellosis and to guide the institution of targeted therapy,leading to complete resolution of the patient's illness.

C. Implementations Details 1. Metagenomic Library Construction

DNA and RNA metagenomic libraries were constructed from the patient'sCSF sample as previously described [2, 3]. After bead-beating usingLysis matrix B (MP Biomedicals, Santa Ana Calif.) at 6 m/s for 30seconds, total nucleic acid was extracted using the Qiagen EZ1 Viral kit(Qiagen, Valencia Calif.). Half of the nucleic acid from CSF was treatedwith Turbo DNase (Ambion, Waltham Mass.), followed by reversetranscription of the RNA to cDNA using random hexamers and NGS librarypreparation using the Nextera XT DNA Library Prep Kit (Illumina, SanDiego Calif.). The remaining half was treated with the NEBNextMicrobiome DNA Enrichment Kit to enrich for microbial DNA (New EnglandBiosciences, Ipswich Mass.)), followed by Nextera XT librarypreparation. Dual-indexed, barcoded NGS libraries were quantitated onthe BioAnalyzer (Agilent, Santa Clara Calif.) and run on the IlluminaHiSeq (1.times.160 bp run).

The patient's CSF sample was processed and sequenced as part of arecently developed standardized operating procedure (SOP) for clinicalmNGS testing from patient samples in the University of California, SanFrancisco (UCSF) Clinical Microbiology Laboratory, a CLIA (ClinicalLaboratory Improvement Amendments)-licensed laboratory (Naccache, etal., manuscript in preparation). For each sequencing run, the SOPincludes running two external controls: (1) a negative “no-template”control (NTC) sample consisting of elution buffer, (2) and a positivecontrol sample consisting of a quantified mixture of 7 representativepathogens (CMV, HIV, Streptococcus agalactiae, Klebsiella pneumoniae,Cryptococcus neoformans, Aspergillus niger, and Toxoplasma gondii). Eachis spiked into negative CSF matrix at a concentration 1-2 log above theestimated limits of detection for that microorganism by probit analysis(Naccache, et al., manuscript in preparation).

2. Bioinformatics Analysis

Metagenomic NGS data was analyzed for pathogens using a modified versionof the SURPI (“sequence-based ultra-rapid pathogen identification”)computational pipeline, which identifies pathogen sequences on the basisof nucleotide alignments to National Center for Biotechnologyinformation (NCBI) nt reference database (March 2015 build) [4]. Aclinical version of the SURPI pipeline, named SURPI+, which employstaxonomic classification for more accurate read assignments andestablish normalized metrics and thresholds for clinical resultsreporting, was used for automated interpretation. Briefly, foridentification of viral sequences, both mNGS data from bath DNA and RNAsample libraries were used, whereas only mNGS data from DNA librariesalone were used for identification of sequences from bacteria, fungi,and parasites. An edit distance cutoff of 16, indicating the number ofsingle nucleotide insertions, deletions, or mismatches allowed betweenthe read and the reference sequence, was used for virus detection,whereas a more stringent edit distance cutoff of 1 was used farbacterial, fungal, and parasitic detection [2, 3]. Following alignment,a rapid taxonomic classification algorithm based on the lowest commonancestor algorithm was used to assign viral, bacterial, and non-chordateeukaryotic (fungal or parasitic) NGS reads to the species, genus, orfamily level, as previously described [2, 3].

3. Results Reporting

As part of clinical validation of mNGS testing for pathogen detection,we have established threshold cutoffs for automated reporting ofpositive results (Naccache, et al., manuscript in preparation). Briefly,for reporting of bacteria, fungi, and parasites, the cutoff is definedas an RPM (reads per million) ratio of ≥10, where the RPM ratio isdefined as the RPM_(sample)/RPM_(NTC) for any given taxon (species,genus, or family). If the taxon is not present in the NTC, then theRPM_(NTC) is 1. For reporting of viruses, the criteria can includecoverage of 2 non-contiguous/non-overlapping gene regions. Viruses withnon-vertebrate hosts, that are found in the NTC, or that constitutenormal body flora (e.g. anelloviruses) are not reported. A “gene region”can be defined as a non-overlapping and non-contiguous region of length<=read length (e.g., 141 bp).

4. Brucella PCR Confirmation

Confirmation of Brucella detection was performed by PCR using thefollowing primer set targeting the IS711 gene [5]:BrucellaGenus_F-2_GCCTTGGATCTGAGCCGTT (SEQ ID NO:1);BrucellaGenus_IS711_R GGCCTACCGCTGCGAAT (SEQ ID NO:2). The reaction wascarried out using the Qiagen One-Step RT-PCR Kit in a 25 μL totalreaction volume by addition of 4 μL Q solution, 4 μL. 5.times. buffer, 1μL dNTP, 1 μL enzyme, 1 μL of each primer at 12 pmol, and 2 μL oftemplate extracted DNA, and water. Cycling conditions were as follows:40 cycles of 94° C., 30 s/53° C., 30 s/72° C., 30. The recoveredsequence was confirmed to be Bruce by Sanger sequencing of the PCRamplicon.

VIII. Exemplary Computer System

FIG. 14 shows a block diagram of an example computer system 10 usablewith system and methods according to embodiments of the presentinvention. For example, computer system 10 can be used in implementingsample analysis system 110 and/or sequence identification computer 120.

Any of the computer systems mentioned herein may utilize any suitablenumber of subsystems. Examples of such subsystems are shown in FIG. 14in computer apparatus 10. In some embodiments, a computer systemincludes a single computer apparatus, where the subsystems can be thecomponents of the computer apparatus. In other embodiments, a computersystem can include multiple computer apparatuses, each being asubsystem, with internal components. A computer system can includedesktop and laptop computers, tablets, mobile phones and other mobiledevices.

The subsystems shown in FIG. 14 are interconnected via a system bus 75.Additional subsystems such as a printer 74, keyboard 78, storagedevice(s) 79, monitor 76, which is coupled to display adapter 82, andothers are shown. Peripherals and input/output (I/O) devices, whichcouple to I/O controller 71, can be connected to the computer system byany number of means known in the art such as input/output (I/O) port 77(e.g., USB, FireWire®). For example, I/O port 77 or external interface81 (e.g. Ethernet, Wi-Fi, etc.) can be used to connect computer system10 to a wide area network such as the Internet, a mouse input device, ora scanner. The interconnection via system bus 75 allows the centralprocessor 73 to communicate with each subsystem and to control theexecution of a plurality of instructions from system memory 72 or thestorage device(s) 79 (e.g., a fixed disk, such as a hard drive, oroptical disk), as well as the exchange of information betweensubsystems. The system memory 72 and/or the storage device(s) 79 mayembody a computer readable medium. Another subsystem is a datacollection device 85, such as a camera, microphone, accelerometer, andthe like. Any of the data mentioned herein can be output from onecomponent to another component and can be output to the user.

A computer system can include a plurality of the same components orsubsystems, e.g., connected together by external interface 81, by aninternal interface, or via removable storage devices that can beconnected and removed from one component to another component. In someembodiments, computer systems, subsystem, or apparatuses can communicateover a network. In such instances, one computer can be considered aclient and another computer a server, where each can be part of a samecomputer system. A client and a server can each include multiplesystems, subsystems, or components.

Aspects of embodiments can be implemented in the form of control logicusing hardware (e.g. an application specific integrated circuit or fieldprogrammable gate array) and/or using computer software with a generallyprogrammable processor in a modular or integrated manner. As usedherein, a processor includes a single-core processor, multi-coreprocessor on a same integrated chip, or multiple processing units on asingle circuit board or networked. Based on the disclosure and teachingsprovided herein, a person of ordinary skill in the art will know andappreciate other ways and/or methods to implement embodiments of thepresent invention using hardware and a combination of hardware andsoftware.

Any of the software components or functions described in thisapplication may be implemented as software code to be executed by aprocessor using any suitable computer language such as, for example,Java, C, C++. C#, Objective-C, Go, Swift, or scripting language such asPerl or Python using, for example, conventional or object-orientedtechniques. The software code may be stored as a series of instructionsor commands on a computer readable medium for storage and/ortransmission. A suitable non-transitory computer readable medium caninclude random access memory (RAM), a read only memory (ROM), a magneticmedium such as a hard-drive or a floppy disk, or an optical medium suchas a compact disk (CD) or DVD (digital versatile disk), flash memory,and the like. The computer readable medium may be any combination ofsuch storage or transmission devices.

Such programs may also be encoded and transmitted using carrier signalsadapted for transmission via wired, optical, and/or wireless networksconforming to a variety of protocols, including the Internetc. As such,a computer readable medium may be created using a data signal encodedwith such programs. Computer readable media encoded with the programcode may be packaged with a compatible device or provided separatelyfrom other devices (e.g., via Internet download). Any such computerreadable medium may reside on or within a single computer product (e.g.a hard drive, a CD, or an entire computer system), and may be present onor within different computer products within a system or network. Acomputer system may include a monitor, printer, or other suitabledisplay for providing any of the results mentioned herein to a user.

Any of the methods described herein may be totally or partiallyperformed with a computer system including one or more processors, whichcan be configured to perform the steps. Thus, embodiments can bedirected to computer systems configured to perform the steps of any ofthe methods described herein, potentially with different componentsperforming a respective steps or a respective group of steps. Althoughpresented as numbered steps, steps of methods herein can be performed ata same time or in a different order. Additionally, portions of thesesteps may be used with portions of other steps from other methods. Also,all or portions of a step may be optional. Additionally, any of thesteps of any of the methods can be performed with modules, units,circuits, or other means for performing these steps.

The specific details of particular embodiments may be combined in anysuitable manner without departing from the spirit and scope ofembodiments of the invention. However, other embodiments of theinvention may be directed to specific embodiments relating to eachindividual aspect, or specific combinations of these individual aspects.

The above description of example embodiments of the invention has beenpresented for the purposes of illustration and description. It is notintended to be exhaustive or to limit the invention to the precise formdescribed, and many modifications and variations are possible in lightof the teaching above.

A recitation of “a”, “an” or “the” is intended to mean “one or more”unless specifically indicated to the contrary. The use of “or” isintended to mean an “inclusive or,” and not an “exclusive or” unlessspecifically indicated to the contrary. Reference to a “first” componentdoes not necessarily require that a second component be provided.Moreover reference to a “first” or a “second” component does not limitthe referenced component to a particular location unless expresslystated.

All patents, patent applications, publications, and descriptionsmentioned herein are herein incorporated by reference in their entiretyfor all purposes. None is admitted to be prior art. Such referencesinclude:

-   1. Pappas G, Akritidis N, Bosilkovski M, Tsianos E. Brucellosis. The    New England journal of medicine 2005; 352(22): 2325-36.-   2. Greninger A L, Messacar K, Dunnebacke T, et al. Clinical    metagenomic identification of Balamuthia mandrillaris encephalitis    and assembly of the draft genome: the continuing case for reference    genome sequencing. Genome medicine 2015; 7(1): 113.-   3. Greninger A L, Naccache S N, Messacar K, et al. A novel outbreak    enterovirus D68 strain associated with acute flaccid myelitis cases    in the USA (2012-14): a retrospective cohort study. The Lancet    Infectious diseases 2015; 15(6): 671-82.-   4. Naccache S N, Federman S, Veeraraghavan N, et al. A    cloud-compatible bioinformatics pipeline for ultrarapid pathogen    identification from next-generation sequencing of clinical samples.    Genome research 2014; 24(7): 1180-92.-   5. Hinic V, Brodard I, Thomann A, et al. Novel identification and    differentiation of Brucella melitensis, B. abortus, B. suis, B.    ovis, B. canis, and B. neotomae suitable for both conventional and    real-time PCR systems. Journal of microbiological methods 2008;    75(2): 375--   6. Al-Sous M W, Bolilega S, Al-Kawi M Z, Alwatban J, McLean D R.    Neurobrucellosis: clinical and neuroimaging correlation. AJNR    American journal of neuroradiology 2004; 25(3): 395-401.-   7. Moyer N P H L, Murray P R, Baron E J, Pfaller M A, Tenover F C,    Youlken R H. Brucella. Manual of Clinical Microbiology. ASM Press,    1995:549-55.-   8. Yu W L, Nielsen K. Review of detection of Brucella spp. by    polymerase chain reaction. Croatian medical journal 2010; 51(4):    306-13.-   9. Chiu C Y, Miller S. Next-Generation Sequencing. In: Persing D H,    Tenover F C, Hayden R T, Ieven G, Miller M B, Nolte F S. Molecular    Microbiology, Diagnostic Principles and Practice, 3rd Edition.    Washington, D.C.: ASM Press, 2016:68-79-   10. Dasari S, Naha K, Prabhu M. Brucellosis and tuberculosis:    clinical overlap and pitfalls. Asian Pacific journal of tropical    medicine 2013; 6(10): 823-5.-   11. Verdon R, Chevret S, Laissy J P, Wolff M. Tuberculous meningitis    in adults: review of 48 cases. Clinical infectious diseases: an    official publication of the Infectious Diseases Society of America    1996; 22(6): 982-8.-   12. Naccache S N, Peggs K S, Mattes F M, et al. Diagnosis of    neuroinvasive astrovirus infection in an immunocompromised adult    with encephalitis by unbiased next-generation sequencing. Clinical    infectious diseases: an official publication of the Infectious    Diseases Society of America 2015; 60(6): 919-23.-   13. Wilson M R, Naccache S N, Samayoa E, et al. Actionable diagnosis    of neuroleptospirosis by next-generation sequencing. The New England    journal of medicine 2014; 370(25): 2408-17.

APPENDIX

Supplementary Table 1 # of Reads DNA DNA DNA Patient species genusfamily NTC PC CSF * Brucella Brucellaceae 0 0 251 Brucella melitensisBrucella Brucellaceae 0 0 26 Streptococcus agalactiae StreptococcusStreptococcaceae 9 50,894 0 Klebsiella pneumoniae KlebsiellaEnterobacteriaceae 426 39,013 4 * * Enterobacteriaceae 246,369 21,3422,350 Escherichia coli Escherichia Enterobacteriaceae 170,966 11,9591,601 Propionibacterium acnes Propionibacterium Propionibacteriaceae126,533 5,329 1,158 * Streptococcus Streptococcaceae 236 3,556 2 *Klebsiella Enterobacteriaceae 49 3,542 0 * * * 41,826 3,194 457Streptococcus suis Streptococcus Streptococcaceae 20 561 0 Streptococcussalivarius Streptococcus Streptococcaceae 188 333 13 Streptococcuspyogenes Streptococcus Streptococcaceae 0 299 0 Serratia marcescensSerratia Enterobacteriaceae 59 233 0 Lactococcus lactis LactococcusStreptococcaceae 338 223 2 * Escherichia Enterobacteriaceae 3,865 210 35Pseudomonas sp. TKP Pseudomonas Pseudomonadaceae 941 205 69 *Pseudomonas Pseudomonadaceae 499 185 11 Streptococcus StreptococcusStreptococcaceae 0 177 0 macedonicus Enterobacter cloacae EnterobacterEnterobacteriaceae 235 160 1 Klebsiella oxytoca KlebsiellaEnterobacteriaceae 102 129 4 Thermoanaerobacterium ThermoanaerobacteriumThermoanaerobacterales 268 107 2 thermosaccharolyticum Family III.Incertae Sedis Streptococcus sp. VT 162 Streptococcus Streptococcaceae90 100 0 Enterococcus faecium Enterococcus Enterococcaceae 526 96 85Pseudomonas protegens Pseudomonas Pseudomonadaceae 17 87 3 Micrococcusluteus Micrococcus Micrococcaceae 6,424 78 3 Streptococcus infantariusStreptococcus Streptococcaceae 94 75 0 Staphylococcus StaphylococcusStaphylococcaceae 8,168 74 13 epidermidis Enterobacter asburiaeEnterobacter Enterobacteriaceae 250 65 0 Cupriavidus metalliduransCupriavidus Burkholderiaceae 934 61 0 Pseudomonas putida PseudomonasPseudomonadaceae 300 57 1 Enterococcus Enterococcus Enterococcaceae 37154 79 casseliflavus Streptococcus mitis Streptococcus Streptococcaceae55 53 0 Streptococcus equi Streptococcus Streptococcaceae 0 51 0Salmonella enterica Salmonella Enterobacteriaceae 1 48 0 Streptococcusoralis Streptococcus Streptococcaceae 51 47 0 Klebsiella variicolaKlebsiella Enterobacteriaceae 0 42 0 Burkholderia lata BurkholderiaBurkholderiaceae 85 40 3 Pseudomonas stutzeri PseudomonasPseudomonadaceae 1,094 40 32 Streptococcus StreptococcusStreptococcaceae 63 39 3 pneumoniae Streptococcus StreptococcusStreptococcaceae 3 36 0 dysgalactiae * Burkholderia Burkholderiaceae 17135 18 Pseudomonas fluorescens Pseudomonas Pseudomonadaceae 50 35 4Acinetobacter guillouiae Acinetobacter Moraxellaceae 207 35 4Veillonella parvula Veillonella Veillonellaceae 151 33 7 Xanthomonascampestris Xanthomonas Xanthomonadaceae 164 32 24 Exiguobacterium sp.Exiguobacterium 271 32 0 AT1b Pseudomonas Pseudomonas Pseudomonadaceae932 32 0 pseudoalcaligenes Streptococcus Streptococcus Streptococcaceae0 27 0 pasteurianus Rothia dentocariosa Rothia Micrococcaceae 270 252 * * Streptococcaceae 0 25 0 Delftia acidovorans Delftia Comamonadaceae28 25 0 * Propionibacterium Propionibacteriaceae 773 23 8 Acinetobacterbaumannii Acinetobacter Moraxellaceae 853 23 5 StreptococcusStreptococcus Streptococcaceae 393 23 2 parasanguinis Pseudomonas sp.Pseudomonas Pseudomonadaceae 15 22 1 WCS374 Enterobacter aerogenesEnterobacter Enterobacteriaceae 0 22 0 Stenotrophomonas StenotrophomonasXanthomonadaceae 78 22 0 maltophilia Alicyclobacillus AlicyclobacillusAlicyclobacillaceae 79 21 2 acidocaldarius Haemophilus influenzaeHaemophilus Pasteurellaceae 145 19 13 Rothia mucilaginosa RothiaMicrococcaceae 379 19 3 Staphylococcus xylosus StaphylococcusStaphylococcaceae 68 18 3 Acidovorax sp. JS42 Acidovorax Comamonadaceae92 18 3 * Rahnella Enterobacteriaceae 0 18 0 Streptococcus StreptococcusStreptococcaceae 30 18 0 pseudopneumoniae Streptococcus StreptococcusStreptococcaceae 114 17 4 thermophilus Pseudomonas aeruginosaPseudomonas Pseudomonadaceae 114 17 1 Corynebacterium CorynebacteriumCorynebacteriaceae 0 16 0 kroppenstedtii Staphylococcus StaphylococcusStaphylococcaceae 1,320 16 0 haemolyticus Serratia sp. SCBI SerratiaEnterobacteriaceae 0 15 0 * Enterococcus Enterococcaceae 123 14 5Staphylococcus warneri Staphylococcus Staphylococcaceae 326 14 3Burkholderia cepacia Burkholderia Burkholderiaceae 63 13 9 *Corynebacterium Corynebacteriaceae 167 13 7 Comamonas testosteroniComamonas Comamonadaceae 175 13 1 Bifidobacterium BifidobacteriumBifidobacteriaceae 0 13 0 thermophilum * Lactobacillus Lactobacillaceae50 13 0 Thermoanaerobacterium ThermoanaerobacteriumThermoanaerobacterales 56 13 0 xylanolyticum Family III. Incertae SedisRalstonia pickettii Ralstonia Burkholderiaceae 147 13 0 Meiothermusruber Meiothermus Thermaceae 540 13 0 Acidovorax ebreus AcidovoraxComamonadaceae 10 12 0 Micrococcus sp. V7 Micrococcus Micrococcaceae 3612 0 Leuconostoc Leuconostoc Leuconostocaceae 335 11 1 mesenteroidesBacillus halodurans Bacillus Bacillaceae 9 11 0 CorynebacteriumCorynebacterium Corynebacteriaceae 20 11 0 variabile * AcidovoraxComamonadaceae 48 11 0 Bifidobacterium bifidum BifidobacteriumBifidobacteriaceae 619 11 0 Rhizobium sp. IRBG74 Rhizobium Rhizobiaceae182 10 4 * Acinetobacter Moraxellaceae 569 10 1 AcinetobacterAcinetobacter Moraxellaceae 0 10 0 calcoaceticus ChroococcidiopsisChroococcidiopsis 28 10 0 thermalis Streptococcus sanguinisStreptococcus Streptococcaceae 47 10 0 Acidovorax sp. KKS102 AcidovoraxComamonadaceae 74 10 0 Raoultella ornithinolytica RaoultellaEnterobacteriaceae 94 10 0 Ochrobactrum anthropi OchrobactrumBrucellaceae 270 10 0 Lactobacillus johnsonii LactobacillusLactobacillaceae 9 9 2 Methylobacterium populi MethylobacteriumMethylobacteriaceae 30 9 3 Rhodococcus equi Rhodococcus Nocardiaceae 629 4 Lactobacillus helveticus Lactobacillus Lactobacillaceae 4 9 0 *Serratia Enterobacteriaceae 30 9 0 Burkholderia cenocepacia BurkholderiaBurkholderiaceae 15 8 1 * Staphylococcus Staphylococcaceae 1,001 8 6Enterobacter sp. R4-368 Enterobacter Enterobacteriaceae 0 8 0Propionibacterium Propionibacterium Propionibacteriaceae 31 8 0propionicum Streptococcus gordonii Streptococcus Streptococcaceae 35 8 0Corynebacterium Corynebacterium Corynebacteriaceae 48 8 0 singulareBurkholderia ambifaria Burkholderia Burkholderiaceae 0 7 1 * MicrococcusMicrococcaceae 142 7 2 Fervidobacterium Fervidobacterium Thermotogaceae83 7 1 nodosum Aeromonas media Aeromonas Aeromonadaceae 433 7 3Cronobacter sakazakii Cronobacter Enterobacteriaceae 0 7 0 Myroidesprofundi Myroides Flavobacteriaceae 0 7 0 Methylobacterium oryzaeMethylobacterium Methylobacteriaceae 0 7 0 * XanthomonasXanthomonadaceae 13 7 0 Thermoanaerobacterium ThermoanaerobacteriumThermoanaerobacterales 73 7 0 saccharolyticum Family III. Incertae SedisPseudomonas mendocina Pseudomonas Pseudomonadaceae 84 7 0Corynebacterium Corynebacterium Corynebacteriaceae 315 7 0ureicelerivorans Lactobacillus crispatus Lactobacillus Lactobacillaceae74 6 3 Alicycliphilus Alicycliphilus Comamonadaceae 132 6 2denitrificans Gardnerella vaginalis Gardnerella Bifidobacteriaceae 1,1406 2 * Gemella 0 6 0 * Ralstonia Burkholderiaceae 0 6 0 Eggerthella lentaEggerthella Coriobacteriaceae 3 6 0 * * Rhizobiaceae 7 6 0 Prevotelladenticola Prevotella Prevotellaceae 16 6 0 Prevotella intermediaPrevotella Prevotellaceae 18 6 0 Psychrobacter sp. PRwf-1 PsychrobacterMoraxellaceae 31 6 0 Azospira oryzae Azospira Rhodocyclaceae 38 6 0Acinetobacter Acinetobacter Moraxellaceae 47 6 0 haemolyticus * DelftiaComamonadaceae 54 6 0 Burkholderia contaminans BurkholderiaBurkholderiaceae 7 5 4 Arthrobacter arilaitensis ArthrobacterMicrococcaceae 20 5 7 Dermacoccus Dermacoccus Dermacoccaceae 31 5 6nishinomiyaensis Pantoea ananatis Pantoea Enterobacteriaceae 40 5 5Staphylococcus Staphylococcus Staphylococcaceae 148 5 8 saprophyticusStaphylococcus pasteuri Staphylococcus Staphylococcaceae 69 5 1 Rahnellaaquatilis Rahnella Enterobacteriaceae 0 5 0 Rahnella sp. Y9602 RahnellaEnterobacteriaceae 0 5 0 Campylobacter concisus CampylobacterCampylobacteraceae 0 5 0 Geobacillus sp. WCH70 Geobacillus Bacillaceae 05 0 * Frankia Frankiaceae 0 5 0 Lactobacillus casei LactobacillusLactobacillaceae 0 5 0 Thiomonas intermedia Thiomonas 0 5 0Streptococcus gallolyticus Streptococcus Streptococcaceae 5 5 0Thioalkalivibrio Thioalkalivibrio Ectothiorhodospiraceae 10 5 0sulfidiphilus * Bradyrhizobium Bradyrhizobiaceae 13 5 0 Bifidobacteriumlongum Bifidobacterium Bifidobacteriaceae 27 5 0 Corynebacteriumfalsenii Corynebacterium Corynebacteriaceae 32 5 0 Delftia sp. Cs1-4Delftia Comamonadaceae 40 5 0 Acinetobacter sp. M131 AcinetobacterMoraxellaceae 79 5 0 Prevotella Prevotella Prevotellaceae 151 5 0melaninogenica * * Comamonadaceae 217 5 0 Leuconostoc camosumLeuconostoc Leuconostocaceae 13 4 1 Pectobacterium PectobacteriumEnterobacteriaceae 74 4 3 carotovorum * Myroides Flavobacteriaceae 0 40 * Erwinia Enterobacteriaceae 0 4 0 Gordonia sp. KTR9 GordoniaGordoniaceae 0 4 0 Paenibacillus sp. FSL R7- PaenibacillusPaenibacillaceae 0 4 0 0273 Paracoccus sp. N81106 ParacoccusRhodobacteraceae 0 4 0 Sphingobium fuliginis SphingobiumSphingomonadaceae 0 4 0 * * 1 4 0 * Geobacillus Bacillaceae 10 4 0Pseudomonas sp. Pseudomonas Pseudomonadaceae 11 4 0 VLB 120Pelagibacterium Pelagibacterium Hyphomicrobiaceae 22 4 0 halotoleransStreptococcus intermedius Streptococcus Streptococcaceae 34 4 0Propionibacterium Propionibacterium Propionibactcriaceae 45 4 0freudenreichii * Enterobacter Enterobacteriaceae 72 4 0 Nakamurellamultipartita Nakamurella Nakamurellaceae 32 3 7 Haemophilus parasuisHaemophilus Pasteurellaceae 0 3 0 Fusobacterium nucleatum FusobacteriumFusobacteriaceae 0 3 0 Citrobacter freundii CitrobacterEnterobacteriaceae 1 3 0 Ruminococcus sp. SR1/5 RuminococcusRuminococcaceae 9 3 0 Pseudoxanthomonas PseudoxanthomonasXanthomonadaceae 10 3 0 spadix Lactococcus garvieae LactococcusStreptococcaceae 15 3 0 Neisseria elongala Neisseria Neisseriaceae 18 30 Acidovorax citrulli Acidovorax Comamonadaceae 19 3 0 NovosphingobiumNovosphingobium Sphingomonadaceae 20 3 0 pentaromativorans Citrobacterkoseri Citrobacter Enterobacteriaceae 26 3 0 MethylobacteriumMethylobacterium Methylobacteriaceae 32 3 0 aquaticum PseudomonasPseudomonas Pseudomonadaceae 34 3 0 denitrificans Rhodococcuserythropolis Rhodococcus Nocardiaceae 36 3 0 Lactobacillus reuteriLactobacillus Lactobacillaceae 48 3 0 Bacteroides fragilis BacteroidesBacteroidaceae 70 3 0 Lactobacillus plantarum LactobacillusLactobacillaceae 137 3 0 * Bacillus Bacillaceae 196 3 0 PseudomonasPseudomonas Pseudomonadaceae 0 2 1 rhizosphaerae AchromobacterAchromobacter Alcaligenaceae 41 2 1 xylosoxidans Lactobacillusamylovorus Lactobacillus Lactobacillaceae 0 2 0 PropionibacteriumPropionibacterium Propionibacteriaceae 0 2 0 acidipropionici Leuconostocgelidum Leuconostoc Leuconostocaceae 0 2 0 Weissella thailandensisWeissella Leuconostocaceae 0 2 0 Pandoraea sp. RB-44 PandoraeaBurkholderiaceae 0 2 0 Escherichia vulneris EscherichiaEnterobacteriaceae 0 2 0 Yersinia intermedia Yersinia Enterobacteriaceae0 2 0 Flavobacteriaceae Flavobacteriaceae 0 2 0 bacterium 3519-10 *Rhodococcus Nocardiaceae 0 2 0 Streptococcus sp. (N1) StreptococcusStreptococcaceae 0 2 0 * Methylobacterium Methylobacteriaceae 8 2 0Sphingomonas sp. MM-1 Sphingomonas Sphingomonadaceae 9 2 0 Rhizobiumetli Rhizobium Rhizobiaceae 18 2 0 * Agrobacterium Rhizobiaceae 41 2 0Thermus scotoductus Thermus Thermaceae 53 2 0 MethylobacteriumMethylobacterium Methylobacteriaceae 60 2 0 extorquens Streptococcus sp.I-P16 Streptococcus Streptococcaceae 63 2 0 Pantoea sp. PSNIH1 PantoeaEnterobacteriaceae 76 2 0 Methylobacterium MethylobacteriumMethylobacteriaceae 96 2 0 radiotolerans [Ruminococcus] torques BlautiaLachnospiraceae 0 1 7 Lactobacillus sakei Lactobacillus Lactobacillaceae0 1 2 * * Rhodocyclaceae 0 1 1 Bacillus licheniformis BacillusBacillaceae 34 1 5 Corynebacterium Corynebacterium Corynebacteriaceae 01 0 accolens Corynebacterium sp. Corynebacterium Corynebacteriaceae 0 10 ATCC 6931 Serratia symbiotica Serratia Enterobacteriaceae 0 1 0Lactobacillus delbrueckii Lactobacillus Lactobacillaceae 0 1 0 Bacillussp. YP1 Bacillus Bacillaceae 0 1 0 Klebsiella sp. PG122E KlebsiellaEnterobacteriaceae 0 1 0 * Rathayibacter Microbacteriaceae 0 1 0Pseudoalteromonas sp. Pseudoalteromonas Pseudoalteromonadaceae 0 1 0 P30Staphylococcus sp. Staphylococcus Staphylococcaceae 0 1 0 CDC25Corynebacterium Corynebacterium Corynebacteriaceae 1 1 0 resistensShigella dysenteriae Shigella Enterobacteriaceae 9 1 0 * *Xanthomonadaceae 11 1 0 Agrobacterium fabrum Agrobacterium Rhizobiaceae19 1 0 Gordonia Gordonia Gordoniaceae 22 1 0 polyisoprenivoransPseudomonas balearica Pseudomonas Pseudomonadaceae 25 1 0 * *Pseudomonadaceae 25 1 0 Ruminococcus bromii Ruminococcus Ruminococcaceae29 1 0 Brachybacterium faecium Brachybacterium Dermabacteraceae 31 1 0Acinetobacter johnsonii Acinetobacter Moraxellaceae 37 1 0 Micrococcussp. A1 Micrococcus Micrococcaceae 95 1 0 Filifactor alocis FilifactorPeptostreptococcaceae 132 1 0 Pantoea vagans Pantoea Enterobacteriaceae162 1 0 Haemophilus Haemophilus Pasteurellaceae 241 1 0 parainfluenzaePantoea rwandensis Pantoea Enterobacteriaceae 0 0 5 CorynebacteriumCorynebacterium Corynebacteriaceae 0 0 4 vitaeruminis Pseudomonas poaePseudomonas Pseudomonadaceae 8 0 3 * * Brucellaceae 0 0 2 Lactobacillusfermentum Lactobacillus Lactobacillaceae 0 0 2 Anabaena variabilisAnabaena Nostocaceae 0 0 2 Sphingobacterium sp. SphingobacteriumSphingobacteriaceae 0 0 2 ML3W sugarcane isolate 74-1 0 0 2 * *Geodermatophilaceae 1 0 2 Megasphaera elsdenii MegasphaeraVeillonellaceae 0 0 1 Pseudoxanthomonas PseudoxanthomonasXanthomonadaceae 7 0 1 suwonensis Corynebacterium CorynebacteriumCorynebacteriaceae 157 0 17 glutamicum Sphingomonas taxi SphingomonasSphingomonadaceae 19 0 2 Pseudomonas graminis PseudomonasPseudomonadaceae 12 0 1 Bradyrhizobium sp. BradyrhizobiumBradyrhizobiaceae 21 0 1 BTAi1 Enterococcus hirae EnterococcusEnterococcaceae 25 0 1 Corynebacterium sp. L2- CorynebacteriumCorynebacteriaceae 34 0 1 79-05 Arthrobacter Arthrobacter Micrococcaceae36 0 1 phenanthrenivorans Corynebacterium mans CorynebacteriumCorynebacteriaceae 87 0 1 Gordonia bronchialis Gordonia Gordoniaceae 900 1 Kytococcus sedentarius Kytococcus Dermacoccaceae 130 0 1 Kosakoniacowanii Kosakonia Enterobacteriaceae 1 0 0 Xenorhabdus bovieniiXenorhabdus Enterobacteriaceae 1 0 0 Paracoccus haeundaensis ParacoccusRhodobacteraceae 1 0 0 Methylobacterium sp. 238 MethylobacteriumMethylobacteriaceae 1 0 0 Acinetobacter sp. BW3 AcinetobacterMoraxellaceae 1 0 0 Aeromonas sobria Aeromonas Aeromonadaceae 1 0 0Bacillus lehensis Bacillus Bacillaceae 1 0 0 Ralstonia solanacearumRalstonia Burkholderiaceae 1 0 0 Citrobacter sp. FPO3 CitrobacterEnterobacteriaceae 1 0 0 Citrobacter sp. I91-3 CitrobacterEnterobacteriaceae 1 0 0 Erwinia amylovora Erwinia Enterobacteriaceae 10 0 Klebsiella milletis Klebsiella Enterobacteriaceae 1 0 0 Salmonellabongori Salmonella Enterobacteriaceae 1 0 0 Serratia grimesii SerratiaEnterobacteriaceae 1 0 0 Yersinia pestis Yersinia Enterobacteriaceae 1 00 * Enterobacteriaceae 1 0 0 Lactobacillus brevis LactobacillusLactobacillaceae 1 0 0 Kocuria sp. starX Kocuria Micrococcaceae 1 0 0Acinetobacter sp. 26 Acinetobacter Moraxellaceae 1 0 0 Peptoclostridiumdifficile Peptoclostridium Peptostreptococcaceae 1 0 0 Sorangiumcellulosum Sorangium Polyangiaceae 1 0 0 Pseudomonas sp. NSi14Pseudomonas Pseudomonadaceae 1 0 0 Synergistetes oral clone 1 0 0 03 5D05 bacterium EBAD26 1 0 0 bacterium NLAE-zl-G351 1 0 0 rumen bacterium1 0 0 enrichment culture clone Y74 unidentified marine 2 0 0bacterioplankton Escherichia albertii Escherichia Enterobacteriaceae 2 00 Mobiluncus curtisii Mobiluncus Actinomycetaceae 2 0 0 * CaulobacterCaulobacteraceae 2 0 0 Methylotenera versalilis MethyloteneraMethlophilaceae 2 0 0 Propionibacterium sp. PropionibacteriumPropionibacteriaceae 3 0 0 KPL1849 bacterium EBAD25 3 0 0 Bacillus sp.Pc3 Bacillus Bacillaceae 3 0 0 Acinetobacter sp. EVA14 AcinetobacterMoraxellaceae 3 0 0 Agrobacterium sp. Agrobacterium Rhizobiaceae 3 0 0Alistipes shahii Alistipes Rikenellaceae 3 0 0 * Thermus Thermaceae 3 00 * Methylibium 3 0 0 butyrate-producing 3 0 0 bacterium SSC/2Escherichia fergusonii Escherichia Enterobacteriaceae 4 0 0 Enterobactersp. Ni15 Enterobacter Enterobacteriaceae 4 0 0 Capnocytophaga ochraceaCapnocytophaga Flavobacteriaceae 4 0 0 Thauera sp. 6NLG ThaueraRhodocyclaceae 4 0 0 Desulfovibrio alaskensis DesulfovibrioDesulfovibrionaceae 4 0 0 Variovorax sp. Alb14 Variovorax Comamonadaceae4 0 0 * Shigella Enterobacteriaceae 4 0 0 * * Micromonosporaceae 4 0 0 *Micromonospora Micromonosporaceae 4 0 0 Thermobifida fusca ThermobifidaNocardiopsaceae 4 0 0 Turneriella parva Turneriella Leptospiraceae 5 0 0[Clostridium] sticklandii Peptoclostridium Peptostreptococcaceae 5 0 0Acinetobacter sp. Ooi24 Acinetobacter Moraxellaceae 5 0 0 Ochrobactrumsp. SJY1 Ochrobactrum Brucellaceae 5 0 0 Carnobacterium sp.Carnobacterium Carnobacteriaceae 5 0 0 WN1359 Iamia majanohamensis IamiaIamiaceae 5 0 0 Saccharomonospora Saccharomonospora Pseudonocardiaceae 50 0 viridis Rhizobium sp. Rhizobium Rhizobiaceae 5 0 0 Staphylococcussp. CDC3 Staphylococcus Staphylococcaceae 5 0 0 Shigella sonnei ShigellaEnterobacteriaceae 6 0 0 Pseudomonas syringae PseudomonasPseudomonadaceae 6 0 0 Burkholderia Burkholderia Burkholderiaceae 6 0 0vietnamiensis Shigella boydii Shigella Enterobacteriaceae 6 0 0 BacillusBacillus Bacillaceae 6 0 0 weihenstephanensis Erythrobacter litoralisErythrobacter Erythrobacteraceae 6 0 0 PseudoalteromonasPseudoalteromonas Pseudoalteromonadaceae 6 0 0 haloplanktis Pseudomonassp. FGI182 Pseudomonas Pseudomonadaceae 6 0 0 * Rhizobium Rhizobiaceae 60 0 * Rickettsia Rickettsiaceae 6 0 0 Sphingobium yanoikuyae SphingobiumSphingomonadaceae 6 0 0 Stenotrophomonas StenotrophomonasXanthomonadaceae 6 0 0 rhizophila * Leuconostoc Leuconostocaceae 7 0 0Aquincola tertiaricarbonis Aquincola 7 0 0 Nocardiopsis dassonvilleiNocardiopsis Nocardiopsaceae 7 0 0 Carnobacterium CarnobacteriumCarnobacteriaceae 7 0 0 maltaromaticum * Haemophilus Pasteurellaceae 7 00 Bordetella parapertussis Bordetella Alcaligenaceae 7 0 0 * DietziaDietziaceae 7 0 0 Shewanella sp. W3-18-1 Shewanella Shewanellaceae 7 0 0Sphingomonas sp. NP5 Sphingomonas Sphingomonadaceae 7 0 0 StaphylococcusStaphylococcus Staphylococcaceae 7 0 0 gallinarum Micavibrio Micavibrio7 0 0 aeruginosavorus Paracoccus denitrificans ParacoccusRhodobacteraceae 8 0 0 [Cellvibrio] gilvus CellulomonasCellulomonadaceae 8 0 0 Corynebacterium CorynebacteriumCorynebacteriaceae 8 0 0 jeikeium * * Staphylococcaceae 8 0 0Meiothermus silvanus Meiothermus Thermaceae 8 0 0 Asticcacaulisexcentricus Asticcacaulis Caulobacteraceae 8 0 0 * AtopobiumCoriobacteriaceae 8 0 0 Streptococcus constellatus StreptococcusStreptococcaceae 8 0 0 Microcystis aeruginosa Microcystis 8 0 0agricultural soil bacterium 8 0 0 SC-I-13 [Ruminococcus] obeum BlautiaLachnospiraceae 9 0 0 Thermus thermophilus Thermus Thermaceae 9 0 0Shigella flexneri Shigella Enterobacteriaceae 9 0 0 * MycobacteriumMycobacteriaceae 9 0 0 Pseudomonas savastanoi PseudomonasPseudomonadaceae 9 0 0 Staphylococcus capitis StaphylococcusStaphylococcaceae 9 0 0 * Cupriavidus Burkholderiaceae 9 0 0 Dyadobacterfermentans Dyadobacter Cytophagaceae 9 0 0 Dietzia sp. CQ4 DietziaDietziaceae 9 0 0 * * Methylophilaceae 9 0 0 * Neisseria Neisseriaceae 90 0 Corynebacterium Corynebacterium Corynebacteriaceae 10 0 0aurimucosum * 10 0 0 Pseudomonas fulva Pseudomonas Pseudomonadaceae 10 00 Chromohalobacter Chromohalobacter Halomonadaceae 10 0 0 salexigensBrevundimonas diminuta Brevundimonas Caulobacteraceae 10 0 0Streptococcus lutetiensis Streptococcus Streptococcaceae 10 0 0Bordetella petrii Bordetella Alcaligenaceae 10 0 0 Erythrobacter sp.JP13.1 Erythrobacter Erythrobacteraceae 10 0 0 MethylobacillusMethylobacillus Methylophilaceae 10 0 0 glycogenes Candidatus RhodolunaCandidatus Rhodoluna Microbacteriaceae 10 0 0 lacicola Arthrobacter sp.JBH1 Arthrobacter Micrococcaceae 10 0 0 Aggregatibacter AggregatibacterPasteurellaceae 10 0 0 aphrophilus Thauera sp. B4 Thauera Rhodocyclaceae10 0 0 Lysobacter dokdonensis Lysobacter Xanthomonadaceae 10 0 0Clostridiales genomosp. 10 0 0 BVAB3 Streptococcus sp. I-G2Streptococcus Streptococcaceae 11 0 0 Pseudomonas mandelii PseudomonasPseudomonadaceae 11 0 0 Bradyrhizobium sp. BradyrhizobiumBradyrhizobiaceae 11 0 0 S23321 Phenylobacterium PhenylobacteriumCaulobacteraceae 11 0 0 zucineum Pseudomonas mosselii PseudomonasPseudomonadaceae 11 0 0 Staphylococcus Staphylococcus Staphylococcaceae11 0 0 lugdunensis Proteus mirabilis Proteus Enterobacteriaceae 11 00 * * Neisseriaceae 11 0 0 Arthrobacter sp. J3-40 ArthrobacterMicrococcaceae 11 0 0 * Pantoea Enterobacteriaceae 12 0 0Corynebacterium Corynebacterium Corynebacteriaceae 12 0 0 efficiens *Halomonas Halomonadaceae 12 0 0 Trueperella pyogenes TrueperellaActinomycetaceae 12 0 0 Streptomyces coelicolor StreptomycesStreptomycetaceae 12 0 0 Kocuria rhizophila Kocuria Micrococcaceae 13 00 Bacillus cereus Bacillus Bacillaceae 13 0 0 Tannerella forsythiaTannerella Porphyromonadaceae 13 0 0 * Alkalibacterium Carnobacteriaceae13 0 0 Atopobium parvulum Atopobium Coriobacteriaceae 13 0 0Serinicoccus profundi Serinicoccus Intrasporangiaceae 13 0 0 *Leptotrichia Leptotrichiaceae 13 0 0 * * Planococcaceae 13 0 0Planomicrobium Planomicrobium Planococcaceae 13 0 0 okeanokoitesJannaschia sp. CCS1 Jannaschia Rhodobacteraceae 13 0 0 Paracoccusaestuarii Paracoccus Rhodobacteraceae 13 0 0 Rhodobacter blasticusRhodobacter Rhodobacteraceae 13 0 0 Agrobacterium AgrobacteriumRhizobiaceae 14 0 0 tumefaciens Shewanella sp. ANA-3 ShewanellaShewanellaceae 14 0 0 Pseudomonas cichorii Pseudomonas Pseudomonadaceae14 0 0 Halomonas sp. A3H3 Halomonas Halomonadaceae 14 0 0 Serratialiquefaciens Serratia Enterobacteriaceae 14 0 0 Sphingopyxis alaskensisSphingopyxis Sphingomonadaceae 14 0 0 * Brevundimonas Caulobacteraceae14 0 0 Deinococcus deserti Deinococcus Deinococcaceae 14 0 0Desulfovibrio vulgaris Desulfovibrio Desulfovibrionaceae 14 0 0Propionibacterium sp. Propionibacterium Propionibacteriaceae 14 0 0NTS31307302 Paracoccus alcaliphilus Paracoccus Rhodobacteraceae 14 0 0Vibrio parahaemolyticus Vibrio Vibrionaceae 14 0 0 Candidatus 14 0 0Saccharibacteria oral taxon TM7x Geitlerinema sp. PCC Geitlerinema 15 00 7407 * Actinomyces Actinomycetaceae 15 0 0 Brevundimonas BrevundimonasCaulobacteraceae 15 0 0 vesicularis Acinetobacter sp. YS0810Acinetobacter Moraxellaceae 15 0 0 * Prevotella Prevotellaceae 15 0 0Methyloceanibacter Methyloceanibacter 15 0 0 caenitepidi Leuconostoccitreum Leuconostoc Leuconostocaceae 16 0 0 * Bacteroides Bacteroidaceae16 0 0 Pseudomonas alcaligenes Pseudomonas Pseudomonadaceae 16 0 0Methylibium Methylibium 16 0 0 petroleiphilum Moraxella catarrhalisMoraxella Moraxellaceae 16 0 0 Sphingopyxis sp. Kp5.2 SphingopyxisSphingomonadaceae 16 0 0 Pandoraea apista Pandoraea Burkholderiaceae 160 0 * Cellulomonas Cellulomonadaceae 16 0 0 Olsenella uli OlsenellaCoriobacteriaceae 16 0 0 Acinetobacter oleivorans AcinetobacterMoraxellaceae 16 0 0 Sphingomonas sp. 133 Sphingomonas Sphingomonadaceae16 0 0 Meiothermus taiwanensis Meiothermus Thermaceae 16 0 0 Deinococcusgeothermalis Deinococcus Deinococcaceae 17 0 0 LactobacillusLactobacillus Lactobacillaceae 17 0 0 sanfranciscensis Acinetobacter sp.Acinetobacter Moraxellaceae 17 0 0 LUH5605 Staphylococcus simulansStaphylococcus Staphylococcaceae 17 0 0 Arsenophonus nasoniaeArsenophonus Enterobacteriaceae 17 0 0 Buchnera aphidicola BuchneraEnterobacteriaceae 17 0 0 Weissella koreensis Weissella Leuconostocaceae17 0 0 Psychrobacter sp. G Psychrobacter Moraxellaceae 17 0 0Amycolicicoccus Amycolicicoccus Mycobacteriaceae 17 0 0 subflavusStaphylococcus hyicus Staphylococcus Staphylococcaceae 17 0 0 Morganellamorganii Morganella Enterobacteriaceae 18 0 0 Kyrpidia tusciae KyrpidiaAlicyclobacillaceae 18 0 0 Ramlibacter tataouinensis RamlibacterComamonadaceae 18 0 0 Weeksella virosa Weeksella Flavobacteriaceae 18 00 Acinetobacter junii Acinetobacter Moraxellaceae 18 0 0 Acinetobactersp. 26B2 Acinetobacter Moraxellaceae 18 0 0 Mycobacterium abscessusMycobacterium Mycobacteriaceae 18 0 0 Neisseria gonorrhoeae NeisseriaNeisseriaceae 18 0 0 Sphingomonas wittichii SphingomonasSphingomonadaceae 19 0 0 Bacteroides dorei Bacteroides Bacteroidaceae 190 0 Corynebacterium Corynebacterium Corynebacteriaceae 19 0 0halotolerans Pelomonas aquatica Pelomonas Comamonadaceae 19 0 0Janibacter sp. TYM3221 Janibacter Intrasporangiaceae 19 0 0 *Arthrobacter Micrococcaceae 19 0 0 Mycobacterium gordonae MycobacteriumMycobacteriaceae 19 0 0 Pimelobacter simplex PimelobacterNocardioidaceae 19 0 0 Pseudomonas sp. Pseudomonas Pseudomonadaceae 19 00 OM2164 Streptomyces albus Streptomyces Streptomycetaceae 20 0 0Halomonas halocynthiae Halomonas Halomonadaceae 20 0 0 Nitrosomonas sp.AL212 Nitrosomonas Nitrosomonadaceae 20 0 0 SphingobacteriumSphingobacterium Sphingobacteriaceae 20 0 0 mizutaii Vibrio choleraeVibrio Vibrionaceae 20 0 0 Rickettsia felis Rickettsia Rickettsiaceae 210 0 * * Moraxellaceae 21 0 0 Bacteroidetes bacterium 21 0 0 WD317Corynebacterium casei Corynebacterium Corynebacteriaceae 22 0 0Corynebacterium Corynebacterium Corynebacteriaceae 22 0 0 marinum * *Flavobacteriaceae 22 0 0 Caulobacter segnis Caulobacter Caulobacteraceae22 0 0 Lactobacillus gasseri Lactobacillus Lactobacillaceae 22 0 0 *Meiothermus Thermaceae 22 0 0 Rhizobium sp. NT-26 Rhizobium Rhizobiaceae23 0 0 Bacillus coagulans Bacillus Bacillaceae 23 0 0 * SphingomonasSphingomonadaceae 23 0 0 * Brevibacterium Brevibacteriaceae 23 0 0Nitrosomonas curopaca Nitrosomonas Nitrosomonadaceae 23 0 0 PseudomonasPseudomonas Pseudomonadaceae 24 0 0 alkylphenolia Terrabacter sp. DBF63Terrabacter Intrasporangiaceae 24 0 0 beta proteobacterium CB 24 0 0Moraxella ovis Moraxella Moraxellaceae 24 0 0 Shewanella balticaShewanella Shewanellaceae 24 0 0 Mycobacterium gilvum MycobacteriumMycobacteriaceae 25 0 0 * Exiguobacterium 25 0 0 * OchrobactrumBrucellaceae 25 0 0 Geodermatophilus GeodermatophilusGeodermatophilaceae 25 0 0 obscurus * Devosia Hyphomicrobiaceae 25 0 0Moraxella osloensis Moraxella Moraxellaceae 25 0 0 Exiguobacterium sp.11- Exiguobacterium 25 0 0 28 Nocardioides sp. JS614 NocardioidesNocardioidaceae 26 0 0 Nocardioides sp. USM2 NocardioidesNocardioidaceae 26 0 0 Burkholderia gladioli BurkholderiaBurkholderiaceae 27 0 0 Renibacterium Renibacterium Micrococcaceae 27 00 salmoninarum Pseudomonas syringae Pseudomonas Pseudomonadaceae 27 0 0group genomosp. 3 Bifidobacterium Bifidobacterium Bifidobacteriaceae 270 0 pseudolongum toluene-degrading 27 0 0 bacterium UCR 021tCorynebacterium imitans Corynebacterium Corynebacteriaceae 28 0 0Corynebacterium callunae Corynebacterium Corynebacteriaceae 28 0 0 Boseasp. WAO Bosea Bradyrhizobiaceae 28 0 0 Xanthobacter XanthobacterXanthobacteraceae 29 0 0 autotrophicus Corynebacterium CorynebacteriumCorynebacteriaceae 29 0 0 diphtheriae Bacteroides BacteroidesBacteroidaceae 29 0 0 thetaiotaomicron Caulobacter vibrioidesCaulobacter Caulobacteraceae 29 0 0 * * Pasteurellaceae 30 0 0Finegoldia magna Finegoldia Peptoniphilaceae 30 0 0 Anaerococcusprevotii Anaerococcus Peptoniphilaceae 30 0 0 Azorhizobium AzorhizobiumXanthobacteraceae 30 0 0 caulinodans Erysipelothrix ErysipelothrixErysipelotrichaceae 31 0 0 rhusiopathiae Porphyromonas PorphyromonasPorphyromonadaceae 31 0 0 asaccharolytica * SphingopyxisSphingomonadaceae 31 0 0 Eubacterium rectale Eubacterium Eubacteriaceae32 0 0 Acinetobacter venetianus Acinetobacter Moraxellaceae 33 0 0Variovorax paradoxus Variovorax Comamonadaceae 33 0 0 Acinetobacter sp.ED45- Acinetobacter Moraxellaceae 33 0 0 25 BradyrhizobiumBradyrhizobium Bradyrhizobiaceae 34 0 0 diazoefficiens BradyrhizobiumBradyrhizobium Bradyrhizobiaceae 35 0 0 japonicum Megamonas hypermegaleMegamonas Veillonellaceae 35 0 0 * Methylobacillus Methylophilaceae 35 00 Nocardiopsis alba Nocardiopsis Nocardiopsaceae 35 0 0 Modestobactermarinus Modestobacter Geodermatophilaceae 36 0 0 CorynebacteriumCorynebacterium Corynebacteriaceae 37 0 0 doosanense Blastococcussaxobsidens Blastococcus Geodermatophilaceae 38 0 0 AnoxybacillusAnoxybacillus Bacillaceae 38 0 0 flavithermus Aeromonas caviae AeromonasAeromonadaceae 40 0 0 Bacillus subtilis Bacillus Bacillaceae 40 0 0Elizabethkingia anophelis Elizabethkingia Flavobacteriaceae 40 0 0Staphylococcus hominis Staphylococcus Staphylococcaceae 44 0 0Ruminococcus bicirculans Ruminococcus Ruminococcaceae 45 0 0 Paracoccusmarcusii Paracoccus Rhodobacteraceae 45 0 0 * PsychrobacterMoraxellaceae 46 0 0 Sphingomonas Sphingomonas Sphingomonadaceae 47 0 0sanxanigenens * * Leuconostocaceae 47 0 0 * * Burkholderiaceae 49 0 0Bacteroides vulgatus Bacteroides Bacteroidaceae 50 0 0 RhodopseudomonasRhodopseudomonas Bradyrhizobiaceae 51 0 0 palustris Pantoea agglomeransPantoea Enterobacteriaceae 52 0 0 * * Sphingomonadaceae 53 0 0Mycobacterium kansasii Mycobacterium Mycobacteriaceae 53 0 0 *Streptomyces Streptomycetaceae 55 0 0 Enterococcus faecalis EnterococcusEnterococcaceae 55 0 0 Acinetobacter sp. NFM2 AcinetobacterMoraxellaceae 57 0 0 Shewanella putrefaciens Shewanella Shewanellaceae62 0 0 Bifidobacterium Bifidobacterium Bifidobacteriaceae 63 0 0adolescentis * Bifidobacterium Bifidobacteriaceae 66 0 0 Porphyromonasgingivalis Porphyromonas Porphyromonadaceae 68 0 0 Neisseriameningitidis Neisseria Neisseriaceae 69 0 0 Rhodococcus RhodococcusNocardiaceae 70 0 0 pyridinivorans Aeromonas salmonicida AeromonasAeromonadaceae 73 0 0 Planococcus sp. PAMC Planococcus Planococcaceae 750 0 21323 Pseudomonas simiae Pseudomonas Pseudomonadaceae 81 0 0Faecalibacterium Faecalibacterium Ruminococcaceae 95 0 0 prausnitziiAcinetobacter lwoffii Acinetobacter Moraxellaceae 142 0 0Exiguobacterium sp. Exiguobacterium 146 0 0 N139 Streptococcus anginosusStreptococcus Streptococcaceae 156 0 0 Thauera sp. MZ1T ThaueraRhodocyclaceae 159 0 0 * Shewanella Shewanellaceae 226 0 0 * AeromonasAeromonadaceae 332 0 0 Staphylococcus aureus StaphylococcusStaphylococcaceae 407 0 0 Aeromonas hydrophila Aeromonas Aeromonadaceae537 0 0 Aeromonas veronii Aeromonas Aeromonadaceae 563 0 0Bifidobacterium breve Bifidobacterium Bifidobacteriaceae 892 0 0Bacillus megaterium Bacillus Bacillaceae 2,131 0 0

Supplementary Table 2 Readcount DNA Patient species genus family DNA NTCDNA PC CSF Cryptococcus neoformans Filobasidiella Tremellaceae 0 104,1350 Aspergillus niger Aspergillus Aspergillaceae 8 21,725 0 Toxoplasmagondii Toxoplasma Sarcocystidae 0 9,228 0 * Filobasidiella Tremellaceae0 519 0 * * Sarcocystidae 0 266 0 Neospora caninum NeosporaSarcocystidae 0 50 0 Hammondia triffittae Hammondia Sarcocystidae 0 240 * * Aspergillaceae 0 20 0 * Hammondia Sarcocystidae 0 20 0 *Aspergillus Aspergillaceae 50 173 0 Hammondia hammondi HammondiaSarcocystidae 7 20 0 Aspergillus fumigatus Aspergillus Aspergillaceae 015 0 Aspergillus awamori Aspergillus Aspergillaceae 0 7 0 Bartheletiaparadoxa Bartheletia Bartheletiaceae 0 7 0 Cryptococcus gattiiFilobasidiella Tremellaceae 0 7 0 Aspergillus kawachii AspergillusAspergillaceae 0 6 0 Spirometra erinaceieuropaei SpirometraDiphyllobothriidae 3 5 3 Mucor racemosus Mucor Mucoraceae 0 5 0Aspergillus oryzae Aspergillus Aspergillaceae 0 4 0 Setosphaeria turcicaSetosphaeria Pleosporaceae 0 4 0 Caenorhabditis remanei CaenorhabditisRhabditidae 0 3 0 Aspergillus tubingensis Aspergillus Aspergillaceae 0 30 Anisakis simplex Anisakis Anisakidae 7 2 0 Penicillium solitumPenicillium Aspergillaceae 0 1 0 Plectosphaerella sp. 93 OA-2013Plectosphaerella Plectosphaerellaceae 0 1 0 * * Tremellaceae 0 1 0Malassezia globosa Malassezia Malasseziaceae 231 35 2 Gongylonemapulchrum Gongylonema Gongylonematidae 27 4 0 Wallemia sebi Wallemia 6810 1 * * * 2,821 394 17 Candida parapsilosis Candida Debaryomycetaceae323 45 6 Elaeophora elaphi Elaeophora Onchocercidae 77 10 1 Lichtheimiahongkongensis Lichtheimia Lichtheimiaceae 70 6 0 Brugia timori BrugiaOnchocercidae 117 3 0 Parastrongyloides trichosuri ParastrongyloidesStrongyloididae 293 5 1 Sordaria macrospora Sordaria Sordariaceae 7 0 1Bipolaris sorokiniana Bipolaris Pleosporaceae 34 0 3 Malasseziarestricta Malassezia Malasseziaceae 1 0 0 Cladosporium oxysporumCladosporium Cladosporiaceae 1 0 0 Strongylocentrotus purpuratusStrongylocentrotus Strongylocentrotidae 5 0 0 Caenorhabditis elegansCaenorhabditis Rhabditidae 5 0 0 Penicillium griseoroseum PenicilliumAspergillaceae 7 0 0 Fusarium graminearum Fusarium Nectriaceae 7 0 0Leptosphaeria biglobosa Leptosphaeria Leptosphaeriaceae 8 0 0 Chaetomiumglobosum Chaetomium Chaetomiaceae 8 0 0 Myceliophthora thermophilaMyceliophthora Chaetomiaceae 8 0 0 Alternaria tenuissima AlternariaPleosporaceae 8 0 0 * Saccharomyces Saccharomycetaceae 8 0 0 Aphanomyceseuteiches Aphanomyces Saprolegniaceae 8 0 0 Ophiostoma piliferumOphiostoma Ophiostomataceae 9 0 0 Rhodotorula taiwanensis Rhodotorula 90 0 Trichosporon domesticum Trichosporon 10 0 0 Phoma herbarum PhomaDidymellaceae 12 0 0 Erysiphe alphitoides Erysiphe Erysiphaceae 13 0 0 *Leptosphaeria Leptosphaeriaceae 15 0 0 Zymoseptoria tritici ZymoseptoriaMycosphaerellaceae 15 0 0 Debaryomyces hansenii DebaryomycesDebaryomycetaceae 15 0 0 Alternaria alternata Alternaria Pleosporaceae16 0 0 * Phoma Didymellaceae 17 0 0 * Umbilicaria Umbilicariaceae 18 00 * Penicillium Aspergillaceae 20 0 0 Wuchereria bancrofti WuchereriaOnchocercidae 20 0 0 Cladosporium cladosporioides CladosporiumCladosporiaceae 20 0 0 Saccharomyces bayanus SaccharomycesSaccharomycetaceae 20 0 0 Fusarium phaseoli Fusarium Nectriaceae 21 0 0Dasybranchus sp. DH1 Dasybranchus Capitellidae 23 0 0 Neofusicoccumparvum Neofusicoccum Botryosphaeriaceae 30 0 0 * CladosporiumCladosporiaceae 30 0 0 Penicillium rubens Penicillium Aspergillaceae 330 0 Ustilago maydis Ustilago Ustilaginaceae 38 0 0 Mucor circinelloidesMucor Mucoraceae 68 0 0 Albugo laibachii Albugo Albuginaceae 115 0 0

Supplementary Table 3 Normalized Reads Per million DNA Patient speciesgenus family DNA PC CSF * Brucella Brucellaceae 0.0 15.7^(#) Brucellamelitensis Brucella Brucellaceae 0.0 1.6 Streptococcus agalactiaeStreptococcus Streptococcaceae 3,264.1* 0.0 Klebsiella pneumoniaeKlebsiella Enterobacteriaceae 52.9* 0.0 * * Enterobacteriaceae 0.1 0.0Escherichia coli Escherichia Enterobacteriaceae 0.0 0.0Propionibacterium acnes Propionibacterium Propionibacteriaceae 0.0 0.0 *Streptococcus Streptococcaceae 8.7 0.0 * Klebsiella Enterobacteriaceae41.6^(#) 0.0 * * * 0.0 0.0 Streptococcus suis StreptococcusStreptococcaceae 16.2^(#&) 0.0 Streptococcus salivarius StreptococcusStreptococcaceae 1.0 0.0 Streptococcus pyogenes StreptococcusStreptococcaceae 30.7^(#&) 0.0 Serratia marcescens SerratiaEnterobacteriaceae 2.3 0.0 Lactococcus lactis LactococcusStreptococcaceae 0.4 0.0 * Escherichia Enterobacteriaceae 0.0 0.0Pseudomonas sp. TKP Pseudomonas Pseudomonadaceae 0.1 0.0 * PseudomonasPseudomonadaceae 0.2 0.0 Streptococcus macedonicus StreptococcusStreptococcaceae 18.2^(#&) 0.0 Enterobacter cloacae EnterobacterEnterobacteriaceae 0.4 0.0 Klebsiella oxytoca KlebsiellaEnterobacteriaceae 0.7 0.0 Thermoanaerobacterium ThermoanaerobacteriumThermoanaerobacterales 0.2 0.0 thermosaccharolyticum Family III.Incertae Sedis Streptococcus sp. VT 162 Streptococcus Streptococcaceae0.6 0.0 Enterococcus faecium Enterococcus Enterococcaceae 0.1 0.1Pseudomonas protegens Pseudomonas Pseudomonadaceae 3.0 0.1 Micrococcusluteus Micrococcus Micrococcaceae 0.0 0.0 Streptococcus infantariusStreptococcus Streptococcaceae 0.5 0.0 Staphylococcus epidermidisStaphylococcus Staphylococcaceae 0.0 0.0 Enterobacter asburiaeEnterobacter Enterobacteriaceae 0.2 0.0 Cupriavidus metalliduransCupriavidus Burkholderiaceae 0.0 0.0 Pseudomonas putida PseudomonasPseudomonadaceae 0.1 0.0 Enterococcus casseliflavus EnterococcusEnterococcaceae 0.1 0.1 Streptococcus mitis StreptococcusStreptococcaceae 0.6 0.0 Streptococcus equi StreptococcusStreptococcaceae 5.2 0.0 Salmonella enterica SalmonellaEnterobacteriaceae 4.9 0.0 Streptococcus oralis StreptococcusStreptococcaceae 0.5 0.0 Klebsiella variicola KlebsiellaEnterobacteriaceae 4.3 0.0 Burkholderia lata BurkholderiaBurkholderiaceae 0.3 0.0 Pseudomonas stutzeri PseudomonasPseudomonadaceae 0.0 0.0 Streptococcus pneumoniae StreptococcusStreptococcaceae 0.4 0.0 Streptococcus dysgalactiae StreptococcusStreptococcaceae 3.7 0.0 * Burkholderia Burkholderiaceae 0.1 0.0Pseudomonas fluorescens Pseudomonas Pseudomonadaceae 0.4 0.0Acinetobacter guillouiae Acinetobacter Moraxellaceae 0.1 0.0 Veillonellaparvula Veillonella Veillonellaceae 0.1 0.0 Xanthomonas campestrisXanthomonas Xanthomonadaceae 0.1 0.1 Exiguobacterium sp. AT1bExiguobacterium 0.1 0.0 Pseudomonas Pseudomonas Pseudomonadaceae 0.0 0.0pseudoalcaligenes Streptococcus pasteurianus StreptococcusStreptococcaceae 2.8 0.0 Rothia dentocariosa Rothia Micrococcaceae 0.10.0 * * Streptococcaceae 2.6 0.0 Delftia acidovorans DelftiaComamonadaceae 0.5 0.0 * Propionibacterium Propionibacteriaceae 0.0 0.0Acinetobacter baumannii Acinetobacter Moraxellaceae 0.0 0.0Streptococcus parasanguinis Streptococcus Streptococcaceae 0.0 0.0Pseudomonas sp. WCS374 Pseudomonas Pseudomonadaceae 0.8 0.0 Enterobacteraerogenes Enterobacter Enterobacteriaceae 2.3 0.0 StenotrophomonasStenotrophomonas Xanthomonadaceae 0.2 0.0 maltophilia AlicyclobacillusAlicyclobacillus Alicyclobacillaceae 0.2 0.0 acidocaldarius Haemophilusinfluenzae Haemophilus Pasteurellaceae 0.1 0.0 Rothia mucilaginosaRothia Micrococcaceae 0.0 0.0 Staphylococcus xylosus StaphylococcusStaphylococcaceae 0.2 0.0 Acidovorax sp. JS42 Acidovorax Comamonadaceae0.1 0.0 * Rahnella Enterobacteriaceae 1.8 0.0 StreptococcusStreptococcus Streptococcaceae 0.3 0.0 pseudopneumoniae Streptococcusthermophilus Streptococcus Streptococcaceae 0.1 0.0 Pseudomonasaeruginosa Pseudomonas Pseudomonadaceae 0.1 0.0 CorynebacteriumCorynebacterium Corynebacteriaceae 1.6 0.0 kroppenstedtii StaphylococcusStaphylococcus Staphylococcaceae 0.0 0.0 haemolyticus Serratia sp. SCBISerratia Enterobacteriaceae 1.5 0.0 * Enterococcus Enterococcaceae 0.10.0 Staphylococcus warneri Staphylococcus Staphylococcaceae 0.0 0.0Burkholderia cepacia Burkholderia Burkholderiaceae 0.1 0.1 *Corynebacterium Corynebacteriaceae 0.0 0.0 Comamonas testosteroniComamonas Comamonadaceae 0.0 0.0 Bifidobacterium BifidobacteriumBifidobacteriaceae 1.3 0.0 thermophilum * Lactobacillus Lactobacillaceae0.2 0.0 Thermoanaerobacterium ThermoanaerobacteriumThermoanaerobacterales 0.1 0.0 xylanolyticum Family III. Incertae SedisRalstonia pickettii Ralstonia Burkholderiaceae 0.1 0.0 Meiothermus ruberMeiothermus Thermaceae 0.0 0.0 Acidovorax ebreus AcidovoraxComamonadaceae 0.7 0.0 Micrococcus sp. V7 Micrococcus Micrococcaceae 0.20.0 Leuconostoc mesenteroides Leuconostoc Leuconostocaceae 0.0 0.0Bacillus halodurans Bacillus Bacillaceae 0.7 0.0 Corynebacteriumvariabile Corynebacterium Corynebacteriaceae 0.3 0.0 * AcidovoraxComamonadaceae 0.1 0.0 Bifidobacterium bifidum BifidobacteriumBifidobacteriaceae 0.0 0.0 Rhizobium sp. IRBG74 Rhizobium Rhizobiaceae0.0 0.0 * Acinetobacter Moraxellaceae 0.0 0.0 Acinetobactercalcoaceticus Acinetobacter Moraxellaceae 1.0 0.0 Chroococcidiopsisthermalis Chroococcidiopsis 0.2 0.0 Streptococcus sanguinisStreptococcus Streptococcaceae 0.1 0.0 Acidovorax sp. KKS102 AcidovoraxComamonadaceae 0.1 0.0 Raoultella ornithinolytica RaoultellaEnterobacteriaceae 0.1 0.0 Ochrobactrum anthropi OchrobactrumBrucellaceae 0.0 0.0 Lactobacillus johnsonii LactobacillusLactobacillaceae 0.6 0.1 Methylobacterium populi MethylobacteriumMethylobacteriaceae 0.2 0.0 Rhodococcus equi Rhodococcus Nocardiaceae0.1 0.0 Lactobacillus helveticus Lactobacillus Lactobacillaceae 0.90.0 * Serratia Enterobacteriaceae 0.2 0.0 Burkholderia cenocepaciaBurkholderia Burkholderiaceae 0.3 0.0 * Staphylococcus Staphylococcaceae0.0 0.0 Enterobacter sp. R4-368 Enterobacter Enterobacteriaceae 0.8 0.0Propionibacterium Propionibacterium Propionibacteriaceae 0.1 0.0propionicum Streptococcus gordonii Streptococcus Streptococcaceae 0.10.0 Corynebacterium singulare Corynebacterium Corynebacteriaceae 0.1 0.0Burkholderia ambifaria Burkholderia Burkholderiaceae 0.7 0.1 *Micrococcus Micrococcaceae 0.0 0.0 Fervidobacterium nodosumFervidobacterium Thermotogaceae 0.0 0.0 Aeromonas media AeromonasAeromonadaceae 0.0 0.0 Cronobacter sakazakii CronobacterEnterobacteriaceae 0.7 0.0 Myroides profundi Myroides Flavobacteriaceae0.7 0.0 Methylobacterium oryzae Methylobacterium Methylobacteriaceae 0.70.0 * Xanthomonas Xanthomonadaceae 0.3 0.0 ThermoanaerobacteriumThermoanaerobacterium Thermoanaerobacterales 0.1 0.0 saccharolyticumFamily III. Incertae Sedis Pseudomonas mendocina PseudomonasPseudomonadaceae 0.0 0.0 Corynebacterium CorynebacteriumCorynebacteriaceae 0.0 0.0 ureicelerivorans Lactobacillus crispatusLactobacillus Lactobacillaceae 0.0 0.0 Alicycliphilus denitrificansAlicycliphilus Comamonadaceae 0.0 0.0 Gardnerella vaginalis GardnerellaBifidobacteriaceae 0.0 0.0 * Gemella 0.6 0.0 * RalstoniaBurkholderiaceae 0.6 0.0 Eggerthella lenta Eggerthella Coriobacteriaceae0.6 0.0 * * Rhizobiaceae 0.5 0.0 Prevotella denticola PrevotellaPrevotellaceae 0.2 0.0 Prevotella intermedia Prevotella Prevotellaceae0.2 0.0 Psychrobacter sp. PRwf-1 Psychrobacter Moraxellaceae 0.1 0.0Azospira oryzae Azospira Rhodocyclaceae 0.1 0.0 Acinetobacterhaemolyticus Acinetobacter Moraxellaceae 0.1 0.0 * DelftiaComamonadaceae 0.1 0.0 Burkholderia contaminans BurkholderiaBurkholderiaceae 0.4 0.2 Arthrobacter arilaitensis ArthrobacterMicrococcaceae 0.1 0.1 Dermacoccus Dermacoccus Dermacoccaceae 0.1 0.1nishinomiyaensis Pantoea ananatis Pantoea Enterobacteriaceae 0.1 0.0Staphylococcus Staphylococcus Staphylococcaceae 0.0 0.0 saprophyticusStaphylococcus pasteuri Staphylococcus Staphylococcaceae 0.0 0.0Rahnella aquatilis Rahnella Enterobacteriaceae 0.5 0.0 Rahnella sp.Y9602 Rahnella Enterobacteriaceae 0.5 0.0 Campylobacter concisusCampylobacter Campylobacteraceae 0.5 0.0 Geobacillus sp. WCH70Geobacillus Bacillaceae 0.5 0.0 * Frankia Frankiaceae 0.5 0.0Lactobacillus casei Lactobacillus Lactobacillaceae 0.5 0.0 Thiomonasintermedia Thiomonas 0.5 0.0 Streptococcus gallolyticus StreptococcusStreptococcaceae 0.5 0.0 Thioalkalivibrio sulfidiphilus ThioalkalivibrioEctothiorhodospiraceae 0.3 0.0 * Bradyrhizobium Bradyrhizobiaceae 0.20.0 Bifidobacterium longum Bifidobacterium Bifidobacteriaceae 0.1 0.0Corynebacterium falsenii Corynebacterium Corynebacteriaceae 0.1 0.0Delftia sp. Cs1-4 Delftia Comamonadaceae 0.1 0.0 Acinetobacter sp. M131Acinetobacter Moraxellaceae 0.0 0.0 Prevotella melaninogenica PrevotellaPrevotellaceae 0.0 0.0 * * Comamonadaceae 0.0 0.0 Leuconostoc carnosumLeuconostoc Leuconostocaceae 0.2 0.0 Pectobacterium carotovorumPectobacterium Enterobacteriaceae 0.0 0.0 * Myroides Flavobacteriaceae0.4 0.0 * Erwinia Enterobacteriaceae 0.4 0.0 Gordonia sp. KTR9 GordoniaGordoniaceae 0.4 0.0 Paenibacillus sp. FSL R7- PaenibacillusPaenibacillaceae 0.4 0.0 0273 Paracoccus sp. N81106 ParacoccusRhodobacteraceae 0.4 0.0 Sphingobium fuliginis SphingobiumSphingomonadaceae 0.4 0.0 * * 0.4 0.0 * Geobacillus Bacillaceae 0.2 0.0Pseudomonas sp. VLB120 Pseudomonas Pseudomonadaceae 0.2 0.0Pelagibacterium halotolerans Pelagibacterium Hyphomicrobiaceae 0.1 0.0Streptococcus intermedius Streptococcus Streptococcaceae 0.1 0.0Propionibacterium Propionibacterium Propionibacteriaceae 0.1 0.0freudenreichii * Enterobacter Enterobacteriaceae 0.0 0.0 Nakamurellamultipartita Nakamurella Nakamurellaceae 0.1 0.1 Haemophilus parasuisHaemophilus Pasteurellaceae 0.3 0.0 Fusobacterium nucleatumFusobacterium Fusobacteriaceae 0.3 0.0 Citrobacter freundii CitrobacterEnterobacteriaceae 0.3 0.0 Ruminococcus sp. SR1/5 RuminococcusRuminococcaceae 0.2 0.0 Pseudoxanthomonas spadix PseudoxanthomonasXanthomonadaceae 0.2 0.0 Lactococcus garvieae LactococcusStreptococcaceae 0.1 0.0 Neisseria elongata Neisseria Neisseriaceae 0.10.0 Acidovorax citrulli Acidovorax Comamonadaceae 0.1 0.0Novosphingobium Novosphingobium Sphingomonadaceae 0.1 0.0pentaromativorans Citrobacter koseri Citrobacter Enterobacteriaceae 0.10.0 Methylobacterium aquaticum Methylobacterium Methylobacteriaceae 0.10.0 Pseudomonas denitrificans Pseudomonas Pseudomonadaceae 0.1 0.0Rhodococcus erythropolis Rhodococcus Nocardiaceae 0.0 0.0 Lactobacillusreuteri Lactobacillus Lactobacillaceae 0.0 0.0 Bacteroides fragilisBacteroides Bacteroidaceae 0.0 0.0 Lactobacillus plantarum LactobacillusLactobacillaceae 0.0 0.0 * Bacillus Bacillaceae 0.0 0.0 PseudomonasPseudomonas Pseudomonadaceae 0.2 0.1 rhizosphaerae Achromobacterxylosoxidans Achromobacter Alcaligenaceae 0.0 0.0 Lactobacillusamylovorus Lactobacillus Lactobacillaceae 0.2 0.0 PropionibacteriumPropionibacterium Propionibacteriaceae 0.2 0.0 acidipropioniciLeuconostoc gelidum Leuconostoc Leuconostocaceae 0.2 0.0 Weissellathailandensis Weissella Leuconostocaceae 0.2 0.0 Pandoraea sp. RB-44Pandoraea Burkholderiaceae 0.2 0.0 Escherichia vulneris EscherichiaEnterobacteriaceae 0.2 0.0 Yersinia intermedia YersiniaEnterobacteriaceae 0.2 0.0 Flavobacteriaceae bacterium Flavobacteriaceae0.2 0.0 3519-10 * Rhodococcus Nocardiaceae 0.2 0.0 Streptococcus sp.(N1) Streptococcus Streptococcaceae 0.2 0.0 * MethylobacteriumMethylobacteriaceae 0.1 0.0 Sphingomonas sp. MM-1 SphingomonasSphingomonadaceae 0.1 0.0 Rhizobium etli Rhizobium Rhizobiaceae 0.10.0 * Agrobacterium Rhizobiaceae 0.0 0.0 Thermus scotoductus ThermusThermaceae 0.0 0.0 Methylobacterium Methylobacterium Methylobacteriaceae0.0 0.0 extorquens Streptococcus sp. I-P16 StreptococcusStreptococcaceae 0.0 0.0 Pantoea sp. PSNIH1 Pantoea Enterobacteriaceae0.0 0.0 Methylobacterium Methylobacterium Methylobacteriaceae 0.0 0.0radiotolerans [Ruminococcus] torques Blautia Lachnospiraceae 0.1 0.4Lactobacillus sakei Lactobacillus Lactobacillaceae 0.1 0.1 * *Rhodocyclaceae 0.1 0.1 Bacillus licheniformis Bacillus Bacillaceae 0.00.1 Corynebacterium accolens Corynebacterium Corynebacteriaceae 0.1 0.0Corynebacterium sp. ATCC Corynebacterium Corynebacteriaceae 0.1 0.0 6931Serratia symbiotica Serratia Enterobacteriaceae 0.1 0.0 Lactobacillusdelbrueckii Lactobacillus Lactobacillaceae 0.1 0.0 Bacillus sp. YP1Bacillus Bacillaceae 0.1 0.0 Klebsiella sp. PG122E KlebsiellaEnterobacteriaceae 0.1 0.0 * Rathayibacter Microbacteriaceae 0.1 0.0Pseudoalteromonas sp. P30 Pseudoalteromonas Pseudoalteromonadaceae 0.10.0 Staphylococcus sp. CDC25 Staphylococcus Staphylococcaceae 0.1 0.0Corynebacterium resistens Corynebacterium Corynebacteriaceae 0.1 0.0Shigella dysenteriae Shigella Enterobacteriaceae 0.1 0.0 * *Xanthomonadaceae 0.1 0.0 Agrobacterium fabrum Agrobacterium Rhizobiaceae0.0 0.0 Gordonia polyisoprenivorans Gordonia Gordoniaceae 0.0 0.0Pseudomonas balearica Pseudomonas Pseudomonadaceae 0.0 0.0 * *Pseudomonadaceae 0.0 0.0 Ruminococcus bromii RuminococcusRuminococcaceae 0.0 0.0 Brachybacterium faecium BrachybacteriumDermabacteraceae 0.0 0.0 Acinetobacter johnsonii AcinetobacterMoraxellaceae 0.0 0.0 Micrococcus sp. A1 Micrococcus Micrococcaceae 0.00.0 Filifactor alocis Filifactor Peptostreptococcaceae 0.0 0.0 Pantoeavagans Pantoea Enterobacteriaceae 0.0 0.0 Haemophilus parainfluenzaeHaemophilus Pasteurellaceae 0.0 0.0 Pantoea rwandensis PantoeaEnterobacteriaceae 0.0 0.3 Corynebacterium CorynebacteriumCorynebacteriaceae 0.0 0.2 vitaeruminis Pseudomonas poae PseudomonasPseudomonadaceae 0.0 0.1 * * Brucellaceae 0.0 0.1 Lactobacillusfermentum Lactobacillus Lactobacillaceae 0.0 0.1 Anabaena variabilisAnabaena Nostocaceae 0.0 0.1 Sphingobacterium sp. ML3W SphingobacteriumSphingobacteriaceae 0.0 0.1 sugarcane isolate 74-1 0.0 0.1 * *Geodermatophilaceae 0.0 0.1 Megasphaera elsdenii MegasphaeraVeillonellaceae 0.0 0.1 Pseudoxanthomonas PseudoxanthomonasXanthomonadaceae 0.0 0.1 suwonensis Corynebacterium CorynebacteriumCorynebacteriaceae 0.0 0.0 glutamicum Sphingomonas taxi SphingomonasSphingomonadaceae 0.0 0.0 Pseudomonas graminis PseudomonasPseudomonadaceae 0.0 0.0 Bradyrhizobium sp. BTAi1 BradyrhizobiumBradyrhizobiaceae 0.0 0.0 Enterococcus hirae EnterococcusEnterococcaceae 0.0 0.0 Corynebacterium sp. L2-79- CorynebacteriumCorynebacteriaceae 0.0 0.0 05 Arthrobacter Arthrobacter Micrococcaceae0.0 0.0 phenanthrenivorans Corynebacterium maris CorynebacteriumCorynebacteriaceae 0.0 0.0 Gordonia bronchialis Gordonia Gordoniaceae0.0 0.0 Kytococcus sedentarius Kytococcus Dermacoccaceae 0.0 0.0Kosakonia cowanii Kosakonia Enterobacteriaceae 0.0 0.0 Xenorhabdusbovienii Xenorhabdus Enterobacteriaceae 0.0 0.0 Paracoccus haeundaensisParacoccus Rhodobacteraceae 0.0 0.0 Methylobacterium sp. 238Methylobacterium Methylobacteriaceae 0.0 0.0 Acinetobacter sp. BW3Acinetobacter Moraxellaceae 0.0 0.0 Aeromonas sobria AeromonasAeromonadaceae 0.0 0.0 Bacillus lehensis Bacillus Bacillaceae 0.0 0.0Ralstonia solanacearum Ralstonia Burkholderiaceae 0.0 0.0 Citrobactersp. FPO3 Citrobacter Enterobacteriaceae 0.0 0.0 Citrobacter sp. I91-3Citrobacter Enterobacteriaceae 0.0 0.0 Erwinia amylovora ErwiniaEnterobacteriaceae 0.0 0.0 Klebsiella milletis KlebsiellaEnterobacteriaceae 0.0 0.0 Salmonella bongori SalmonellaEnterobacteriaceae 0.0 0.0 Serratia grimesii Serratia Enterobacteriaceae0.0 0.0 Yersinia pestis Yersinia Enterobacteriaceae 0.0 0.0 *Enterobacteriaceae 0.0 0.0 Lactobacillus brevis LactobacillusLactobacillaceae 0.0 0.0 Kocuria sp. starX Kocuria Micrococcaceae 0.00.0 Acinetobacter sp. 26 Acinetobacter Moraxellaceae 0.0 0.0Peptoclostridium difficile Peptoclostridium Peptostreptococcaceae 0.00.0 Sorangium cellulosum Sorangium Polyangiaceae 0.0 0.0 Pseudomonas sp.NSi14 Pseudomonas Pseudomonadaceae 0.0 0.0 Synergistetes oral clone 03 50.0 0.0 D05 bacterium EBAD26 0.0 0.0 bacterium NLAE-zI-G351 0.0 0.0rumen bacterium enrichment 0.0 0.0 culture clone Y74 unidentified marine0.0 0.0 bacterioplankton Escherichia albertii EscherichiaEnterobacteriaceae 0.0 0.0 Mobiluncus curtisii MobiluncusActinomycetaceae 0.0 0.0 * Caulobacter Caulobacteraceae 0.0 0.0Methylotenera versatilis Methylotenera Methylophilaceae 0.0 0.0Propionibacterium sp. Propionibacterium Propionibacteriaceae 0.0 0.0KPL1849 bacterium EBAD25 0.0 0.0 Bacillus sp. Pc3 Bacillus Bacillaceae0.0 0.0 Acinetobacter sp. EVA14 Acinetobacter Moraxellaceae 0.0 0.0Agrobacterium sp. Agrobacterium Rhizobiaceae 0.0 0.0 Alistipes shahiiAlistipes Rikenellaceae 0.0 0.0 * Thermus Thermaceae 0.0 0.0 *Methylibium 0.0 0.0 butyrate-producing 0.0 0.0 bacterium SSC/2Escherichia fergusonii Escherichia Enterobacteriaceae 0.0 0.0Enterobacter sp. Ni15 Enterobacter Enterobacteriaceae 0.0 0.0Capnocytophaga ochracea Capnocytophaga Flavobacteriaceae 0.0 0.0 Thauerasp. 6NLG Thauera Rhodocyclaceae 0.0 0.0 Desulfovibrio alaskensisDesulfovibrio Desulfovibrionaceae 0.0 0.0 Variovorax sp. Alb14Variovorax Comamonadaceae 0.0 0.0 * Shigella Enterobacteriaceae 0.00.0 * * Micromonosporaceae 0.0 0.0 * Micromonospora Micromonosporaceae0.0 0.0 Thermobifida fusca Thermobifida Nocardiopsaceae 0.0 0.0Turneriella parva Turneriella Leptospiraceae 0.0 0.0 [Clostridium]sticklandii Peptoclostridium Peptostreptococcaceae 0.0 0.0 Acinetobactersp. Ooi24 Acinetobacter Moraxellaceae 0.0 0.0 Ochrobactrum sp. SJY1Ochrobactrum Brucellaceae 0.0 0.0 Carnobacterium sp. WN1359Carnobacterium Carnobacteriaceae 0.0 0.0 Iamia majanohamensis IamiaIamiaceae 0.0 0.0 Saccharomonospora viridis SaccharomonosporaPseudonocardiaceae 0.0 0.0 Rhizobium sp. Rhizobium Rhizobiaceae 0.0 0.0Staphylococcus sp. CDC3 Staphylococcus Staphylococcaceae 0.0 0.0Shigella sonnei Shigella Enterobacteriaceae 0.0 0.0 Pseudomonas syringaePseudomonas Pseudomonadaceae 0.0 0.0 Burkholderia vietnamiensisBurkholderia Burkholderiaceae 0.0 0.0 Shigella boydii ShigellaEnterobacteriaceae 0.0 0.0 Bacillus weihenstephanensis BacillusBacillaceae 0.0 0.0 Erythrobacter litoralis ErythrobacterErythrobacteraceae 0.0 0.0 Pseudoalteromonas PseudoalteromonasPseudoalteromonadaceae 0.0 0.0 haloplanktis Pseudomonas sp. FGI182Pseudomonas Pseudomonadaceae 0.0 0.0 * Rhizobium Rhizobiaceae 0.0 0.0 *Rickettsia Rickettsiaceae 0.0 0.0 Sphingobium yanoikuyae SphingobiumSphingomonadaceae 0.0 0.0 Stenotrophomonas StenotrophomonasXanthomonadaceae 0.0 0.0 rhizophila * Leuconostoc Leuconostocaceae 0.00.0 Aquincola tertiaricarbonis Aquincola 0.0 0.0 Nocardiopsisdassonvillei Nocardiopsis Nocardiopsaceae 0.0 0.0 CarnobacteriumCarnobacterium Carnobacteriaceae 0.0 0.0 maltaromaticum * HaemophilusPasteurellaceae 0.0 0.0 Bordetella parapertussis BordetellaAlcaligenaceae 0.0 0.0 * Dietzia Dietziaceae 0.0 0.0 Shewanella sp.W3-18-1 Shewanella Shewanellaceae 0.0 0.0 Sphingomonas sp. NP5Sphingomonas Sphingomonadaceae 0.0 0.0 Staphylococcus gallinarumStaphylococcus Staphylococcaceae 0.0 0.0 Micavibrio aeruginosavorusMicavibrio 0.0 0.0 Paracoccus denitrificans Paracoccus Rhodobacteraceae0.0 0.0 [Cellvibrio] gilvus Cellulomonas Cellulomonadaceae 0.0 0.0Corynebacterium jeikeium Corynebacterium Corynebacteriaceae 0.0 0.0 * *Staphylococcaceae 0.0 0.0 Meiothermus silvanus Meiothermus Thermaceae0.0 0.0 Asticcacaulis excentricus Asticcacaulis Caulobacteraceae 0.00.0 * Atopobium Coriobacteriaceae 0.0 0.0 Streptococcus constellatusStreptococcus Streptococcaceae 0.0 0.0 Microcystis aeruginosaMicrocystis 0.0 0.0 agricultural soil bacterium 0.0 0.0 SC-I-13[Ruminococcus] obeum Blautia Lachnospiraceae 0.0 0.0 Thermusthermophilus Thermus Thermaceae 0.0 0.0 Shigella flexneri ShigellaEnterobacteriaceae 0.0 0.0 * Mycobacterium Mycobacteriaceae 0.0 0.0Pseudomonas savastanoi Pseudomonas Pseudomonadaceae 0.0 0.0Staphylococcus capitis Staphylococcus Staphylococcaceae 0.0 0.0 *Cupriavidus Burkholderiaceae 0.0 0.0 Dyadobacter fermentans DyadobacterCytophagaceae 0.0 0.0 Dietzia sp. CQ4 Dietzia Dietziaceae 0.0 0.0 * *Methylophilaceae 0.0 0.0 * Neisseria Neisseriaceae 0.0 0.0Corynebacterium Corynebacterium Corynebacteriaceae 0.0 0.0 aurimucosum *0.0 0.0 Pseudomonas fulva Pseudomonas Pseudomonadaceae 0.0 0.0Chromohalobacter Chromohalobacter Halomonadaceae 0.0 0.0 salexigensBrevundimonas diminuta Brevundimonas Caulobacteraceae 0.0 0.0Streptococcus lutetiensis Streptococcus Streptococcaceae 0.0 0.0Bordetella petrii Bordetella Alcaligenaceae 0.0 0.0 Erythrobacter sp.JP13.1 Erythrobacter Erythrobacteraceae 0.0 0.0 Methylobacillusglycogenes Methylobacillus Methylophilaceae 0.0 0.0 Candidatus RhodolunaCandidatus Microbacteriaceae 0.0 0.0 lacicola Rhodoluna Arthrobacter sp.JBH1 Arthrobacter Micrococcaceae 0.0 0.0 Aggregatibacter aphrophilusAggregatibacter Pasteurellaceae 0.0 0.0 Thauera sp. B4 ThaueraRhodocyclaceae 0.0 0.0 Lysobacter dokdonensis LysobacterXanthomonadaceae 0.0 0.0 Clostridiales genomosp. 0.0 0.0 BVAB3Streptococcus sp. I-G2 Streptococcus Streptococcaceae 0.0 0.0Pseudomonas mandelii Pseudomonas Pseudomonadaceae 0.0 0.0 Bradyrhizobiumsp. S23321 Bradyrhizobium Bradyrhizobiaceae 0.0 0.0 Phenylobacteriumzucineum Phenylobacterium Caulobacteraceae 0.0 0.0 Pseudomonas mosseliiPseudomonas Pseudomonadaceae 0.0 0.0 Staphylococcus lugdunensisStaphylococcus Staphylococcaceae 0.0 0.0 Proteus mirabilis ProteusEnterobacteriaceae 0.0 0.0 * * Neisseriaceae 0.0 0.0 Arthrobacter sp.J3-40 Arthrobacter Micrococcaceae 0.0 0.0 * Pantoea Enterobacteriaceae0.0 0.0 Corynebacterium efficiens Corynebacterium Corynebacteriaceae 0.00.0 * Halomonas Halomonadaceae 0.0 0.0 Trueperella pyogenes TrueperellaActinomycetaceae 0.0 0.0 Streptomyces coelicolor StreptomycesStreptomycetaceae 0.0 0.0 Kocuria rhizophila Kocuria Micrococcaceae 0.00.0 Bacillus cereus Bacillus Bacillaceae 0.0 0.0 Tannerella forsythiaTannerella Porphyromonadaceae 0.0 0.0 * Alkalibacterium Camobacteriaceae0.0 0.0 Atopobium parvulum Atopobium Coriobacteriaceae 0.0 0.0Serinicoccus profundi Serinicoccus Intrasporangiaceae 0.0 0.0 *Leptotrichia Leptotrichiaceae 0.0 0.0 * * Planococcaceae 0.0 0.0Planomicrobium Planomicrobium Planococcaceae 0.0 0.0 okeanokoitesJannaschia sp. CCS1 Jannaschia Rhodobacteraceae 0.0 0.0 Paracoccusaestuarii Paracoccus Rhodobacteraceae 0.0 0.0 Rhodobacter blasticusRhodobacter Rhodobacteraceae 0.0 0.0 Agrobacterium tumefaciensAgrobacterium Rhizobiaceae 0.0 0.0 Shewanella sp. ANA-3 ShewanellaShewanellaceae 0.0 0.0 Pseudomonas cichorii Pseudomonas Pseudomonadaceae0.0 0.0 Halomonas sp. A3H3 Halomonas Halomonadaceae 0.0 0.0 Serratialiquefaciens Serratia Enterobacteriaceae 0.0 0.0 Sphingopyxis alaskensisSphingopyxis Sphingomonadaceae 0.0 0.0 * Brevundimonas Caulobacteraceae0.0 0.0 Deinococcus deserti Deinococcus Deinococcaceae 0.0 0.0Desulfovibrio vulgaris Desulfovibrio Desulfovibrionaceae 0.0 0.0Propionibacterium sp. Propionibacterium Propionibacteriaceae 0.0 0.0NTS31307302 Paracoccus alcaliphilus Paracoccus Rhodobacteraceae 0.0 0.0Vibrio parahaemolyticus Vibrio Vibrionaceae 0.0 0.0 CandidatusSaccharibacteria 0.0 0.0 oral taxon TM7x Geitlerinema sp. PCC 7407Geitlerinema 0.0 0.0 * Actinomyces Actinomycetaceae 0.0 0.0Brevundimonas vesicularis Brevundimonas Caulobacteraceae 0.0 0.0Acinetobacter sp. YS0810 Acinetobacter Moraxellaceae 0.0 0.0 *Prevotella Prevotellaceae 0.0 0.0 Methyloceanibacter Methyloceanibacter0.0 0.0 caenitepidi Leuconostoc citreum Leuconostoc Leuconostocaceae 0.00.0 * Bacteroides Bacteroidaceae 0.0 0.0 Pseudomonas alcaligenesPseudomonas Pseudomonadaceae 0.0 0.0 Methylibium petroleiphilumMethylibium 0.0 0.0 Moraxella catarrhalis Moraxella Moraxellaceae 0.00.0 Sphingopyxis sp. Kp5.2 Sphingopyxis Sphingomonadaceae 0.0 0.0Pandoraea apista Pandoraea Burkholderiaceae 0.0 0.0 * CellulomonasCellulomonadaceae 0.0 0.0 Olsenella uli Olsenella Coriobacteriaceae 0.00.0 Acinetobacter oleivorans Acinetobacter Moraxellaceae 0.0 0.0Sphingomonas sp. 133 Sphingomonas Sphingomonadaceae 0.0 0.0 Meiothermustaiwanensis Meiothermus Thermaceae 0.0 0.0 Deinococcus geothermalisDeinococcus Deinococcaceae 0.0 0.0 Lactobacillus LactobacillusLactobacillaceae 0.0 0.0 sanfranciscensis Acinetobacter sp. LUH5605Acinetobacter Moraxellaceae 0.0 0.0 Staphylococcus simulansStaphylococcus Staphylococcaceae 0.0 0.0 Arsenophonus nasoniaeArsenophonus Enterobacteriaceae 0.0 0.0 Buchnera aphidicola BuchneraEnterobacteriaceae 0.0 0.0 Weissella koreensis WeissellaLeuconostocaceae 0.0 0.0 Psychrobacter sp. G Psychrobacter Moraxellaceae0.0 0.0 Amycolicicoccus subflavus Amycolicicoccus Mycobacteriaceae 0.00.0 Staphylococcus hyicus Staphylococcus Staphylococcaceae 0.0 0.0Morganella morganii Morganella Enterobacteriaceae 0.0 0.0 Kyrpidiatusciae Kyrpidia Alicyclobacillaceae 0.0 0.0 Ramlibacter tataouinensisRamlibacter Comamonadaceae 0.0 0.0 Weeksella virosa WeeksellaFlavobacteriaceae 0.0 0.0 Acinetobacter junii AcinetobacterMoraxellaceae 0.0 0.0 Acinetobacter sp. 26B2 Acinetobacter Moraxellaceae0.0 0.0 Mycobacterium abscessus Mycobacterium Mycobacteriaceae 0.0 0.0Neisseria gonorrhoeae Neisseria Neisseriaceae 0.0 0.0 Sphingomonaswittichii Sphingomonas Sphingomonadaceae 0.0 0.0 Bacteroides doreiBacteroides Bacteroidaceae 0.0 0.0 Corynebacterium CorynebacteriumCorynebacteriaceae 0.0 0.0 halotolerans Pelomonas aquatica PelomonasComamonadaceae 0.0 0.0 Janibacter sp. TYM3221 JanibacterIntrasporangiaceae 0.0 0.0 * Arthrobacter Micrococcaceae 0.0 0.0Mycobacterium gordonae Mycobacterium Mycobacteriaceae 0.0 0.0Pimelobacter simplex Pimelobacter Nocardioidaceae 0.0 0.0 Pseudomonassp. OM2164 Pseudomonas Pseudomonadaceae 0.0 0.0 Streptomyces albusStreptomyces Streptomycetaceae 0.0 0.0 Halomonas halocynthiae HalomonasHalomonadaceae 0.0 0.0 Nitrosomonas sp. AL212 NitrosomonasNitrosomonadaceae 0.0 0.0 Sphingobacterium mizutaii SphingobacteriumSphingobacteriaceae 0.0 0.0 Vibrio cholerae Vibrio Vibrionaceae 0.0 0.0Rickettsia felis Rickettsia Rickettsiaceae 0.0 0.0 * * Moraxellaceae 0.00.0 Bacteroidetes bacterium 0.0 0.0 WD317 Corynebacterium caseiCorynebacterium Corynebacteriaceae 0.0 0.0 Corynebacterium marinumCorynebacterium Corynebacteriaceae 0.0 0.0 * * Flavobacteriaceae 0.0 0.0Caulobacter segnis Caulobacter Caulobacteraceae 0.0 0.0 Lactobacillusgasseri Lactobacillus Lactobacillaceae 0.0 0.0 * Meiothermus Thermaceae0.0 0.0 Rhizobium sp. NT-26 Rhizobium Rhizobiaceae 0.0 0.0 Bacilluscoagulans Bacillus Bacillaceae 0.0 0.0 * Sphingomonas Sphingomonadaceae0.0 0.0 * Brevibacterium Brevibacteriaceae 0.0 0.0 Nitrosomonas europaeaNitrosomonas Nitrosomonadaceae 0.0 0.0 Pseudomonas alkylphenoliaPseudomonas Pseudomonadaceae 0.0 0.0 Terrabacter sp. DBF63 TerrabacterIntrasporangiaceae 0.0 0.0 beta proteobacterium CB 0.0 0.0 Moraxellaovis Moraxella Moraxellaceae 0.0 0.0 Shewanella baltica ShewanellaShewanellaceae 0.0 0.0 Mycobacterium gilvum MycobacteriumMycobacteriaceae 0.0 0.0 * Exiguobacterium 0.0 0.0 * OchrobactrumBrucellaceae 0.0 0.0 Geodermatophilus obscurus GeodermatophilusGeodermatophilaceae 0.0 0.0 * Devosia Hyphomicrobiaceae 0.0 0.0Moraxella osloensis Moraxella Moraxellaceae 0.0 0.0 Exiguobacterium sp.11-28 Exiguobacterium 0.0 0.0 Nocardioides sp. JS614 NocardioidesNocardioidaceae 0.0 0.0 Nocardioides sp. USM2 NocardioidesNocardioidaceae 0.0 0.0 Burkholderia gladioli BurkholderiaBurkholderiaceae 0.0 0.0 Renibacterium Renibacterium Micrococcaceae 0.00.0 salmoninarum Pseudomonas syringae Pseudomonas Pseudomonadaceae 0.00.0 group genomosp. 3 Bifidobacterium Bifidobacterium Bifidobacteriaceae0.0 0.0 pseudolongum toluene-degrading bacterium 0.0 0.0 UCR 021tCorynebacterium imitans Corynebacterium Corynebacteriaceae 0.0 0.0Corynebacterium callunae Corynebacterium Corynebacteriaceae 0.0 0.0Bosea sp. WAO Bosea Bradyrhizobiaceae 0.0 0.0 Xanthobacter autotrophicusXanthobacter Xanthobacteraceae 0.0 0.0 Corynebacterium diphtheriaeCorynebacterium Corynebacteriaceae 0.0 0.0 Bacteroides thetaiotaomicronBacteroides Bacteroidaceae 0.0 0.0 Caulobacter vibrioides CaulobacterCaulobacteraceae 0.0 0.0 * * Pasteurellaceae 0.0 0.0 Finegoldia magnaFinegoldia Peptoniphilaceae 0.0 0.0 Anaerococcus prevotii AnaerococcusPeptoniphilaceae 0.0 0.0 Azorhizobium caulinodans AzorhizobiumXanthobacteraceae 0.0 0.0 Erysipelothrix rhusiopathiae ErysipelothrixErysipelotrichaceae 0.0 0.0 Porphyromonas PorphyromonasPorphyromonadaceae 0.0 0.0 asaccharolytica * SphingopyxisSphingomonadaceae 0.0 0.0 Eubacterium rectale Eubacterium Eubacteriaceae0.0 0.0 Acinetobacter venetianus Acinetobacter Moraxellaceae 0.0 0.0Variovorax paradoxus Variovorax Comamonadaceae 0.0 0.0 Acinetobacter sp.ED45-25 Acinetobacter Moraxellaceae 0.0 0.0 BradyrhizobiumBradyrhizobium Bradyrhizobiaceae 0.0 0.0 diazoefficiens Bradyrhizobiumjaponicum Bradyrhizobium Bradyrhizobiaceae 0.0 0.0 Megamonas hypermegaleMegamonas Veillonellaceae 0.0 0.0 * Methylobacillus Methylophilaceae 0.00.0 Nocardiopsis alba Nocardiopsis Nocardiopsaceae 0.0 0.0 Modestobactermarinus Modestobacter Geodermatophilaceae 0.0 0.0 CorynebacteriumCorynebacterium Corynebacteriaceae 0.0 0.0 doosanense Blastococcussaxobsidens Blastococcus Geodermatophilaceae 0.0 0.0 Anoxybacillusflavithermus Anoxybacillus Bacillaceae 0.0 0.0 Aeromonas caviaeAeromonas Aeromonadaceae 0.0 0.0 Bacillus subtilis Bacillus Bacillaceae0.0 0.0 Elizabethkingia anophelis Elizabethkingia Flavobacteriaceae 0.00.0 Staphylococcus hominis Staphylococcus Staphylococcaceae 0.0 0.0Ruminococcus bicirculans Ruminococcus Ruminococcaceae 0.0 0.0 Paracoccusmarcusii Paracoccus Rhodobacteraceae 0.0 0.0 * PsychrobacterMoraxellaceae 0.0 0.0 Sphingomonas Sphingomonas Sphingomonadaceae 0.00.0 sanxanigenens * * Leuconostocaceae 0.0 0.0 * * Burkholderiaceae 0.00.0 Bacteroides vulgatus Bacteroides Bacteroidaceae 0.0 0.0Rhodopseudomonas Rhodopseudomonas Bradyrhizobiaceae 0.0 0.0 palustrisPantoea agglomerans Pantoea Enterobacteriaceae 0.0 0.0 * *Sphingomonadaceae 0.0 0.0 Mycobacterium kansasii MycobacteriumMycobacteriaceae 0.0 0.0 * Streptomyces Streptomycetaceae 0.0 0.0Enterococcus faecalis Enterococcus Enterococcaceae 0.0 0.0 Acinetobactersp. NFM2 Acinetobacter Moraxellaceae 0.0 0.0 Shewanella putrefaciensShewanella Shewanellaceae 0.0 0.0 Bifidobacterium adolescentisBifidobacterium Bifidobacteriaceae 0.0 0.0 * BifidobacteriumBifidobacteriaceae 0.0 0.0 Porphyromonas gingivalis PorphyromonasPorphyromonadaceae 0.0 0.0 Neisseria meningitidis NeisseriaNeisseriaceae 0.0 0.0 Rhodococcus pyridinivorans RhodococcusNocardiaceae 0.0 0.0 Aeromonas salmonicida Aeromonas Aeromonadaceae 0.00.0 Planococcus sp. PAMC Planococcus Planococcaceae 0.0 0.0 21323Pseudomonas simiae Pseudomonas Pseudomonadaceae 0.0 0.0 Faecalibacteriumprausnitzii Faecalibacterium Ruminococcaceae 0.0 0.0 Acinetobacterlwoffii Acinetobacter Moraxellaceae 0.0 0.0 Exiguobacterium sp. N139Exiguobacterium 0.0 0.0 Streptococcus anginosus StreptococcusStreptococcaceae 0.0 0.0 Thauera sp. MZ1T Thauera Rhodocyclaceae 0.00.0 * Shewanella Shewanellaceae 0.0 0.0 * Aeromonas Aeromonadaceae 0.00.0 Staphylococcus aureus Staphylococcus Staphylococcaceae 0.0 0.0Aeromonas hydrophila Aeromonas Aeromonadaceae 0.0 0.0 Aeromonas veroniiAeromonas Aeromonadaceae 0.0 0.0 Bifidobacterium breve BifidobacteriumBifidobacteriaceae 0.0 0.0 Bacillus megaterium Bacillus Bacillaceae 0.00.0

Supplementary Table 4 Normalized Reads Per million DNA Patient genusfamily DNA PC CSF Filobasidiella Tremellaceae 10,696.4^(#) 0.0Aspergillus Aspergillaceae 1567.5# 0.0 Toxoplasma Sarcocystidae 947.9#0.0 Filobasidiella Tremellaceae 53.3# 0.0 * Sarcocystidae 27.3# 0.0Neospora Sarcocystidae 5.1 0.0 Hammondia Sarcocystidae 2.5 0.0 *Aspergillaceae 2.1 0.0 Hammondia Sarcocystidae 2.1 0.0 AspergillusAspergillaceae 2.0 0.0 Hammondia Sarcocystidae 1.6 0.0 AspergillusAspergillaceae 1.5 0.0 Aspergillus Aspergillaceae 0.7 0.0 BartheletiaBartheletiaceae 0.7 0.0 Filobasidiella Tremellaceae 0.7 0.0 AspergillusAspergillaceae 0.6 0.0 Spirometra Diphyllobothriidae 0.5 0.2 MucorMucoraceae 0.5 0.0 Aspergillus Aspergillaceae 0.4 0.0 SetosphaeriaPleosporaceae 0.4 0.0 Caenorhabditis Rhabditidae 0.3 0.0 AspergillusAspergillaceae 0.3 0.0 Anisakis Anisakidae 0.2 0.0 PenicilliumAspergillaceae 0.1 0.0 Plectosphaerella Plectosphaerellaceae 0.1 0.0 *Tremellaceae 0.1 0.0 Malassezia Malasseziaceae 0.1 0.0 GongylonemaGongylonematidae 0.1 0.0 Wallemia 0.1 0.0 * * 0.1 0.0 CandidaDebaryomycetaceae 0.1 0.0 Elaeophora Onchocercidae 0.1 0.0 LichtheimiaLichtheimiaceae 0.0 0.0 Brugia Onchocercidae 0.0 0.0 ParastrongyloidesStrongyloididae 0.0 0.0 Sordaria Sordariaceae 0.0 0.1 BipolarisPleosporaceae 0.0 0.0 Malassezia Malasseziaceae 0.0 0.0 CladosporiumCladosporiaceae 0.0 0.0 Strongylocentrotus Strongylocentrotidae 0.0 0.0Caenorhabditis Rhabditidae 0.0 0.0 Penicillium Aspergillaceae 0.0 0.0Fusarium Nectriaceae 0.0 0.0 Leptosphaeria Leptosphaeriaceae 0.0 0.0Chaetomium Chaetomiaceae 0.0 0.0 Myceliophthora Chaetomiaceae 0.0 0.0Alternaria Pleosporaceae 0.0 0.0 Saccharomyces Saccharomycetaceae 0.00.0 Aphanomyces Saprolegniaceae 0.0 0.0 Ophiostoma Ophiostomataceae 0.00.0 Rhodotorula 0.0 0.0 Trichosporon 0.0 0.0 Phoma Didymellaceae 0.0 0.0Erysiphe Erysiphaceae 0.0 0.0 Leptosphaeria Leptosphaeriaceae 0.0 0.0Zymoseptoria Mycosphaerellaceae 0.0 0.0 Debaryomyces Debaryomycetaceae0.0 0.0 Alternaria Pleosporaceae 0.0 0.0 Phoma Didymellaceae 0.0 0.0Umbilicaria Umbilicariaceae 0.0 0.0 Penicillium Aspergillaceae 0.0 0.0Wuchereria Onchocercidae 0.0 0.0 Cladosporium Cladosporiaceae 0.0 0.0Saccharomyces Saccharomycetaceae 0.0 0.0 Fusarium Nectriaceae 0.0 0.0Dasybranchus Capitellidae 0.0 0.0 Neofusicoccum Botryosphaeriaceae 0.00.0 Cladosporium Cladosporiaceae 0.0 0.0 Penicillium Aspergillaceae 0.00.0 Ustilago Ustilaginaceae 0.0 0.0 Mucor Mucoraceae 0.0 0.0 AlbugoAlbuginaceae 0.0 0.0

What is claimed is:
 1. A method for identifying one or more pathogens ina sample of biological material, the method comprising, at a computersystem: receiving a plurality of sequence reads obtained from asequencing of DNA molecules or reverse transcribed RNA molecules (cDNA)from the sample of biological material, the sample including DNA or cDNAmolecules from a plurality of organisms; using an alignment technique toalign the plurality of sequence reads to a plurality of classifiedreference genomes in a database, thereby obtaining alignment resultsthat include, for each of at least a portion of the sequence reads, atleast one matching reference genome to which the sequence read aligns,thereby obtaining matching reference genomes, and wherein the alignmentresults include classification information for each of the matchingreference genomes, the classification information including a pluralityof taxonomy identifiers; identifying a set of the sequence reads thataligned to two or more of the classified reference genomes with at aleast a minimum alignment threshold; for each sequence read of the setof the sequence reads: identifying two or more matching referencegenomes of the classified reference genomes to which the sequence readaligns with at least the minimum alignment threshold; assigning ataxonomy identifier from the classification information to each of thetwo or more of the classified reference genomes, thereby obtainingassigned taxonomy identifiers, each taxonomy identifier including atleast two levels of classification, the at least two levels having ahierarchy such that there is a lower level and at least one higherlevel; comparing the assigned taxonomy identifiers of the two or moreclassified reference genomes at each of the at least two levels ofclassification; removing each level of the at least two levels from theassigned taxonomy identifiers that do not match between the two or moreclassified reference genomes; and assigning to the sequence read alowest level of the at least two levels of the assigned taxonomyidentifiers that is shared between the two or more classified referencegenomes; and providing an identification of one or more taxonomyidentifiers corresponding to one or more candidate pathogens based onnumbers of corresponding sequence reads assigned to each of theplurality of taxonomy identifiers.
 2. The method of claim 1, furthercomprising: when none of the at least two levels of the taxonomyidentifiers match, associating an unassigned state with the sequenceread.
 3. The method of claim 1, wherein providing the identification ofthe one or more taxonomy identifiers corresponding to the one or morecandidate pathogens includes providing an amount of correspondingsequence reads.
 4. The method of claim 1, further comprising: usingsequence reads having an assigned taxonomy identifier to identify theone or more taxonomy identifiers corresponding to the one or morecandidate pathogens by: for each of a set of taxonomy identifiers:determining a total score based on at least a coverage of a referencegenome corresponding to the taxonomy identifier, thereby obtaining totalscores for the set of taxonomy identifiers; ranking the total scores;and identifying the one or more taxonomy identifiers exceeding a scorethreshold.
 5. The method of claim 4, wherein the total score isdetermined further based on: a length of the reference genomecorresponding to the taxonomy identifier, and an identity scorecorresponding to a fractional sum over genomic positions in thereference genome, the fractional sum of a fraction of sequence readshaving an exact match at a genomic position relative to all sequencereads covering the genomic position.
 6. The method of claim 5, whereinthe total score for a taxonomy identifier is determined as (thelength+the identity score)*a percent identity score, wherein the percentidentity score corresponds to the identity score divided by a number ofgenomic positions covered by a sequence read aligned to the referencegenome.
 7. The method of claim 1, wherein the sample of biologicalmaterial is from a host organism, the method further comprising:performing a clinical intervention for the host organism based on theidentification of the one or more taxonomy identifiers corresponding tothe one or more candidate pathogens.
 8. The method of claim 1, furthercomprising: preprocessing the plurality of sequence reads to perform atleast one of: remove sequence reads identified as low quality sequencereads, and remove sequence reads identified as low complexity sequencereads.
 9. The method of claim 1, wherein the classified referencegenomes in the database include reference genomes of microorganisms, themethod further comprising: prior to aligning the plurality of sequencereads to the classified reference genomes in the database: aligning theplurality of sequence reads to a reference human genome; and removing aportion of the plurality of sequence reads that align to the referencehuman genome.
 10. The method of claim 9, wherein the classifiedreference genomes in the database include bacterial genomes, fungalgenomes, and viral genomes.
 11. The method of claim 1, furthercomprising: sequencing DNA molecules from the sample of biologicalmaterial to obtain at least one million sequence reads, wherein thealignment technique aligns the at least one million sequence reads. 12.The method of claim 1, further comprising filtering the alignmentresults that are initial alignments results by: for each matchingreference genome of a set of matching reference genomes: based on theinitial alignment results, identifying an optimally-aligning sequenceread that aligns to the matching reference genome with an optimalalignment score that exceeds or is equal to alignments scores of othersequence reads that align to the matching reference genome, therebyobtaining optimally-aligning sequence reads; for each of theoptimally-aligning sequence reads: applying a second alignment techniquefor the optimally-aligning sequence read to the classified referencegenomes to obtain a plurality of new alignment scores; determiningwhether any of the new alignment scores exceeds the optimal alignmentscore for the optimally-aligning sequence read; when one new alignmentscore exceeds the optimal alignment score for the optimally-aligningsequence read, removing the matching reference genome from the set ofmatching reference genomes.
 13. A method for identifying pathogens in atest sample of biological material, the method comprising, at a computersystem: receiving a plurality of sequence reads obtained from asequencing of DNA molecules from the test sample of biological material,the test sample including DNA molecules from a plurality of organisms;using an alignment technique to align the plurality of sequence reads toa plurality of classified reference genomes in a reference database,thereby obtaining alignment results that include, for each of at least aportion of the sequence reads, at least one matching reference genome towhich the sequence read aligns; for each matching reference genome of agroup of one or more matching reference genomes: determining a firstamount of sequence reads from the test sample that align to the matchingreference genome; determining a second amount of sequence reads from anegative control sample that align to the matching reference genome;determining a ratio of the first amount of sequence reads and the secondamount of sequence reads; comparing the ratio to a threshold todetermine whether the ratio exceeds the threshold, wherein a set of oneor more matching reference genomes have the ratio exceeding thethreshold; and providing an output that identifies the set of one ormore matching reference genomes as potential pathogens in the testsample.
 14. The method of claim 13, further comprising: performing thesequencing of DNA or cDNA molecules from the test sample of biologicalmaterial; and sequencing DNA or cDNA molecules from the negative controlsample, wherein a sequencing of the negative control sample is performedin parallel with the sequencing of the test sample.
 15. The method ofclaim 13, wherein the second amount of sequence reads from the negativecontrol sample is retrieved from a contaminant database.
 16. The methodof claim 15, wherein an amount of sequence reads for a reference genomein the contaminant database is reduced or removed after a specified timeperiod with the reference genome not being detected in any one of aplurality of negative control samples.
 17. The method of claim 13,wherein the threshold is
 10. 18. The method of claim 13, furthercomprising: assigning each of the plurality of sequence reads aligned tothe at least one matching reference genomes to a single classificationlevel in a taxonomy, wherein the first amount of sequence reads and thesecond amount of sequence reads are determined for one or moreclassification levels of the matching reference genome.
 19. The methodof claim 13, further comprising: preparing the negative control samplein parallel with preparing the test sample, the test sample beingprepared by purposefully adding DNA or cDNA from a patient sample,wherein preparation of the negative control sample does not includepurposefully adding DNA or cDNA; performing the sequencing of DNA orcDNA molecules from the test sample; and sequencing any DNA or cDNAmolecules from the negative control sample.
 20. The method of claim 19,wherein the negative control sample is a buffer used in performing PCRas part of the sequencing of DNA or cDNA molecules from the test sample.21. A method for identifying pathogens in a sample of biologicalmaterial, the method comprising, at a computer system: receiving aplurality of sequence reads obtained from a sequencing of DNA moleculesfrom the sample of biological material, the sample including DNA or cDNAmolecules from a plurality of organisms; using an initial alignmenttechnique to align the plurality of sequence reads to a plurality ofreference genomes in a database, thereby obtaining initial alignmentresults that include, for each of at least a portion of the sequencereads, a matching reference genome to which the sequence read aligns;for each of a set of matching reference genomes: based on the initialalignment results, identifying an optimally-aligning sequence read thataligns to the matching reference genome with an optimal alignment scorethat exceeds or is equal to alignments scores of other sequence readsthat align to the matching reference genome, thereby obtainingoptimally-aligning sequence reads; for each of the optimally-aligningsequence reads: applying a second alignment technique for theoptimally-aligning sequence read to the plurality of classifiedreference genomes to obtain a plurality of new alignment scores;determining whether any of the new alignment scores exceeds the optimalalignment score for the optimally-aligning sequence read; based on onenew alignment score exceeding the optimal alignment score for theoptimally-aligning sequence read, removing the matching reference genomefrom the set of matching reference genomes; and providing the set ofmatching reference genomes.
 22. The method of claim 21, wherein thematching reference genome is removed from the set of matching referencegenomes further based on whether the matching reference genome shares asame taxonomic level as a reference genome corresponding to the one newalignment score.
 23. The method of claim 21, further comprising:providing the initial alignment results corresponding to the set ofmatching reference genomes.
 24. The method of claim 21, wherein theinitial alignment technique uses a global alignment algorithm and thesecond alignment technique uses a local alignment algorithm.
 25. Themethod of claim 24, wherein the second alignment technique is at least1,000 times slower than the initial alignment technique.